Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program

Wagner, Petra; Beskow, Jonas; Betz, Simon; Edlund, Jens; Gustafson, Joakim; Henter, Gustav Eje; Maguer, Sébastien Le; Malisz, Zofia; Székely, Éva; Tånnander, Christina; Voße, Jana

doi:10.21437/ssw.2019-19

Cited by 55 publications

(44 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even though recent innovations have been leading synthesis systems into producing human-like speech, the flexibility required to render natural human-like speech remains a difficult problem. Additionally, speech synthesis cannot be taken as a general problem with one solution fitting everyone [34]. This is the reason why synthesising expressive speech, as well as adapting the target speech to a given speaker, are the current hot challenges of the community.…”

Section: Discussionmentioning

confidence: 99%

Should robots have accents?

Torre

Maguer

2020

2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

View full text Add to dashboard Cite

Accents are vocal features that immediately tell a listener whether a speaker comes from their same place, i.e. whether they share a social group. This in-groupness is important, as people tend to prefer interacting with others who belong to their same groups. Accents also evoke attitudinal responses based on their supposed prestigious status. These accent-based perceptions might affect interactions between humans and robots. Yet, very few studies so far have investigated the effect of accented robot speakers on users' perceptions and behaviour, and none have collected users' explicit preferences on robot accents. In this paper we present results from a survey of over 500 British speakers, who indicated what accent they would like a robot to have. The biggest proportion of participants wanted a robot to have a Standard Southern British English (SSBE) accent, followed by an Irish accent. Crucially, very few people wanted a robot with their same accent, or with a machine-like voice. These explicit preferences might not turn out to predict more successful interactions, also because of the unrealistic expectations that such human-like vocal features might generate in a user. Nonetheless, it seems that people have an idea of how their artificial companions should sound like, and this preference should be considered when designing them.

show abstract

Section: Discussionmentioning

confidence: 99%

Should robots have accents?

Torre

Maguer

2020

2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

View full text Add to dashboard Cite

show abstract

“…Cambre and Kulkarnia [2019] highlight the social implications of designing voices for smart devices and provide a research framework for designers to utilise to help shape user's experiences. Finally, Wagner et al [2019] discuss the future of evaluating speech synthesis, suggesting a move towards HCI-focused approaches of evaluating speech in appropriate contexts with users. Here, we build upon this existing work and present three challenges for those working in different areas of expressive synthesis.…”

Section: Current Challenges and Future Directions In Expressive Synthesismentioning

confidence: 99%

“…Evaluating speech is currently done using three key approaches [Wagner et al 2019]: objective assessments classifying systems with particular scores or contrasting them with other speech (e.g. through mel-cepstral distortion (MCD) ratings); subjective assessments rating speech on concepts such as intelligibility and naturalness: and behavioural assessments examining user actions like task completion time or physiological arousal.…”

Section: More and Better User Evaluation Neededmentioning

confidence: 99%

Building and Designing Expressive Speech Synthesis

Aylett¹,

Clark²,

Cowan³

et al. 2021

The Handbook on Socially Interactive Agents

View full text Add to dashboard Cite

You all know the test for artificial intelligence -the Turing test. A human judge has a conversation with a human and a computer. If the judge can't tell the machine apart from the human, the machine has passed the test. I now propose a test for computer voices -the Ebert test. If a computer voice can successfully tell a joke and do the timing and delivery as well as Henny Youngman, then that's the voice I want." -Roger Ebert.

show abstract

“…Hinterleitner (2017) recently specified five dimensions of quality for TTS systems: naturalness of voice, prosodic quality, fluency and intelligibility, absence of disturbances, and calmness. While most synthesized voices have reached a high level of intelligibility, it is the perceived quality that still requires clarification (Polkosky and Lewis, 2003;Wagner et al, 2019). Moreover, the quality of the voice was shown to be important in establishing a positive human-robot relationship even when the content of the utterances was unintelligible (McGinn and Torre, 2019).…”

Section: Auditory Perception Of Humanoidsmentioning

confidence: 99%

The Human Takes It All: Humanlike Synthesized Voices Are Perceived as Less Eerie and More Likable. Evidence From a Subjective Ratings Study

2020

View full text Add to dashboard Cite

Background: The increasing involvement of social robots in human lives raises the question as to how humans perceive social robots. Little is known about human perception of synthesized voices.Aim: To investigate which synthesized voice parameters predict the speaker's eeriness and voice likability; to determine if individual listener characteristics (e.g., personality, attitude toward robots, age) influence synthesized voice evaluations; and to explore which paralinguistic features subjectively distinguish humans from robots/artificial agents.Methods: 95 adults (62 females) listened to randomly presented audio-clips of three categories: synthesized (Watson, IBM), humanoid (robot Sophia, Hanson Robotics), and human voices (five clips/category). Voices were rated on intelligibility, prosody, trustworthiness, confidence, enthusiasm, pleasantness, human-likeness, likability, and naturalness. Speakers were rated on appeal, credibility, human-likeness, and eeriness. Participants' personality traits, attitudes to robots, and demographics were obtained.Results: The human voice and human speaker characteristics received reliably higher scores on all dimensions except for eeriness. Synthesized voice ratings were positively related to participants' agreeableness and neuroticism. Females rated synthesized voices more positively on most dimensions. Surprisingly, interest in social robots and attitudes toward robots played almost no role in voice evaluation. Contrary to the expectations of an uncanny valley, when the ratings of human-likeness for both the voice and the speaker characteristics were higher, they seemed less eerie to the participants. Moreover, when the speaker's voice was more humanlike, it was more liked by the participants. This latter point was only applicable to one of the synthesized voices. Finally, pleasantness and trustworthiness of the synthesized voice predicted the likability of the speaker's voice. Qualitative content analysis identified intonation, sound, emotion, and imageability/embodiment as diagnostic features.Discussion: Humans clearly prefer human voices, but manipulating diagnostic speech features might increase acceptance of synthesized voices and thereby support human-robot interaction. There is limited evidence that human-likeness of a voice is negatively linked to the perceived eeriness of the speaker.

show abstract

Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program

Cited by 55 publications

References 27 publications

Should robots have accents?

Should robots have accents?

Building and Designing Expressive Speech Synthesis

The Human Takes It All: Humanlike Synthesized Voices Are Perceived as Less Eerie and More Likable. Evidence From a Subjective Ratings Study

Contact Info

Product

Resources

About