Abstract-Building a text corpus suitable to be used in corpusbased speech synthesis is a time-consuming process that usually requires some human intervention to select the desired phonetic content and the necessary variety of prosodic contexts. If an emotional text-to-speech (TTS) system is desired, the complexity of the corpus generation process increases. This paper presents a study aiming to validate or reject the use of a semantically neutral text corpus for the recording of both neutral and emotional (acted) speech. The use of this kind of texts would eliminate the need to include semantically emotional texts into the corpus. The study has been performed for Basque language. It has been made by performing subjective and objective comparisons between the prosodic characteristics of recorded emotional speech using both semantically neutral and emotional texts. At the same time, the performed experiments allow for an evaluation of the capability of prosody to carry emotional information in Basque language. Prosody manipulation is the most common processing tool used in concatenative TTS. Experiments of automatic recognition of the emotions considered in this paper (the "Big Six emotions") show that prosody is an important emotional indicator, but cannot be the only manipulated parameter in an emotional TTS system-at least not for all the emotions. Resynthesis experiments transferring prosody from emotional to neutral speech have also been performed. They corroborate the results and support the use of a neutral-semantic-content text in databases for emotional speech synthesis.Index Terms-Evaluation of expressivity, prosody analysis, speech corpus design.
This paper presents the experiments made to automatically identify emotion in an emotional speech database for Basque. Three different classifiers have been built: one using spectral features and GMM, other with prosodic features and SVM and the last one with prosodic features and GMM. 86 prosodic features were calculated and then an algorithm to select the most relevant ones was applied. The first classifier gives the best result with a 98.4% accuracy when using 512 mixtures, but the classifier built with the best 6 prosodic features achieves an accuracy of 92.3% in spite of its simplicity, showing that prosodic information is very useful to identify emotions.
This paper presents the acoustical study of an emotional speech database in standard Basque to determine the set of parameters that can be used for the recognition of emotions. The database is divided into two parts, one with neutral texts and another one with texts semantically related with the emotion. The study is performed on both parts, in order to known whether the same criteria may be used to recognize emotions independently of the semantic content of the text. Mean F0, F0 range, maximum positive slope in F0 curve, mean phone duration and RMS energy are analyzed. The parameters selected can distinguish emotions in both corpora, so they are suitable for emotion recognition.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.