2006
DOI: 10.1109/tasl.2006.876121
|View full text |Cite
|
Sign up to set email alerts
|

An objective and subjective study of the role of semantics and prosodic features in building corpora for emotional TTS

Abstract: Abstract-Building a text corpus suitable to be used in corpusbased speech synthesis is a time-consuming process that usually requires some human intervention to select the desired phonetic content and the necessary variety of prosodic contexts. If an emotional text-to-speech (TTS) system is desired, the complexity of the corpus generation process increases. This paper presents a study aiming to validate or reject the use of a semantically neutral text corpus for the recording of both neutral and emotional (act… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0

Year Published

2009
2009
2015
2015

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 70 publications
(25 citation statements)
references
References 13 publications
0
25
0
Order By: Relevance
“…Specifically, they tried to detect a driver's emotional status including stress level, disappointment, and euphoria, using biological signals such as facial electromyograms, electrocardiogram, respiration, and electrodermal activity. Customer service call centers equipped with speech recognition technology can assign rhythms that suit each emotion and can even generate appropriate sound signals suitable for the users' emotions (Navas, Hernaez, & Iker, 2006). As such, emotion detection technologies will heighten the quality of human-machine interaction, and users will be able to interface with their computers in a more enjoyable and useful manner.…”
mentioning
confidence: 99%
“…Specifically, they tried to detect a driver's emotional status including stress level, disappointment, and euphoria, using biological signals such as facial electromyograms, electrocardiogram, respiration, and electrodermal activity. Customer service call centers equipped with speech recognition technology can assign rhythms that suit each emotion and can even generate appropriate sound signals suitable for the users' emotions (Navas, Hernaez, & Iker, 2006). As such, emotion detection technologies will heighten the quality of human-machine interaction, and users will be able to interface with their computers in a more enjoyable and useful manner.…”
mentioning
confidence: 99%
“…The possible answers were the 5 styles of the corpus (see Section 2.1.2) plus the additional option of Don't know/Another (Dk/A) to avoid biasing the results in the case of confusion or doubts between two options. The risk of adding this option is that some evaluators may use it excessively to accelerate the test (Navas et al, 2006). However, this effect was negligible in this test (see right column of Table 3).…”
Section: Test Designmentioning
confidence: 94%
“…Montero et al (1998);Hozjan et al (2002); Navas et al (2006); Morrison et al (2007)). When a large speech corpus is compiled, we must make sure that all the utterances are consistent with the expressive category definition.…”
Section: Subjective Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…The speaker-dependent approach gives much better results than the speaker-independent approach, as shown by the excellent results of Navas et al [29], where about 98% accuracy was achieved by using the Gaussian mixture model (GMM) as a classifier, with prosodic, voice quality as well as Mel frequency cepstral coefficient (MFCC) employed as speech features. However, the speaker-dependent approach is not feasible in many applications that deal with a very large number of possible users (speakers).…”
Section: Methodsmentioning
confidence: 99%