Emotional speech synthesis: from speech database to TTS

Montero, Juan Manuel; Gutiérrez-Arriola, Juana M.; Palazuelos, Sira E.; Enríquez, Emilia; Aguilera, Santiago; Pardo, José Manuel

doi:10.21437/icslp.1998-147

Cited by 55 publications

(6 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Over the last several decades, there has been an emphasis on emotion representation in speech. Montero et al [ 22 ] put forward that synthesised speech cannot be marked as natural sounding in the absence of emotional features. Li and Zhao [ 23 ] used acoustic features to identify emotions in speech.…”

Section: Literature Reviewmentioning

confidence: 99%

Speech Emotion Recognition Using Attention Model

Singh

Saheer

Faust

2023

IJERPH

View full text Add to dashboard Cite

Speech emotion recognition is an important research topic that can help to maintain and improve public health and contribute towards the ongoing progress of healthcare technology. There have been several advancements in the field of speech emotion recognition systems including the use of deep learning models and new acoustic and temporal features. This paper proposes a self-attention-based deep learning model that was created by combining a two-dimensional Convolutional Neural Network (CNN) and a long short-term memory (LSTM) network. This research builds on the existing literature to identify the best-performing features for this task with extensive experiments on different combinations of spectral and rhythmic information. Mel Frequency Cepstral Coefficients (MFCCs) emerged as the best performing features for this task. The experiments were performed on a customised dataset that was developed as a combination of RAVDESS, SAVEE, and TESS datasets. Eight states of emotions (happy, sad, angry, surprise, disgust, calm, fearful, and neutral) were detected. The proposed attention-based deep learning model achieved an average test accuracy rate of 90%, which is a substantial improvement over established models. Hence, this emotion detection model has the potential to improve automated mental health monitoring.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

Speech Emotion Recognition Using Attention Model

Singh

Saheer

Faust

2023

IJERPH

View full text Add to dashboard Cite

show abstract

“…El producto de esta interacción hablante humano-ChatGPT, la «lengua-e» (Mendívil 2023) en tanto que externa, puede analizarse con las herramientas que nos proporcionan disciplinas lingüísticas como el análisis del discurso, la lingüística textual o el análisis de la conversación. Contamos ya con veinticinco años de investigaciones que analizan, desde el campo de la lingüística, las interacciones humano-máquina, como la de Lee y Narayanan (2005), que analizan las conversaciones con un call center, las de Montero et al (1998a;1998b) para el discurso emotivo, o Přibil y Přibilová (2010), que recurren a actores para grabar párrafos, frases cortas y palabras aisladas con el objetivo de aplicar los resultados al discurso robotizado (Text-To-Speech System). Igualmente, las investigaciones de Fischer (2023) y Fischer y Matsumoto (2023), sobre las interacciones con robots sociales o sobre el discurso persuasivo de estos (Fischer, Fischer y Palinko 2023;Fucinato, Niebuhr, Nørskov y Fischer 2023;Langedijk y Fischer 2023).…”

Section: La Inteligencia Artificial Y Los Modelos De Lenguajeunclassified

El hablar y los participantes en la interacción comunicativa: cuando el interlocutor es artificial

Domínguez García

2023

Biblioteca de Babel

View full text Add to dashboard Cite

La proliferación, en los últimos años, de los modelos de lenguaje impulsados por las investigaciones en inteligencia artificial ha hecho que lingüistas de todas las disciplinas se interesen por estudiar las estrategias de estos modelos para generar un lenguaje que imita al de los seres humanos. El objeto de este trabajo es el estudio de la interacción con uno de estos modelos de lenguaje artificial, en concreto de ChatGPT-3. Para ello, se ha analizado un corpus de seis textos que contienen interacciones completas entre un interlocutor humano y este modelo de lenguaje. Se han segmentado los textos tomando como unidad básica de análisis la intervención, en el nivel monologal, y el intercambio, en el nivel dialogal (Briz 2000; Grupo Val.Es.Co. 2014; Pons 2022). También se ha analizado su superestructura, tomando como unidad base la secuencia (Werlich 1975; Adam 1992; Roulet 1991) y, así, poder determinar cuáles son los tipos de texto que pueden observarse en el corpus de referencia

show abstract

“…The average recognition rates achieved were 92.5% and 90% for DBN and BP, respectively. In addition, in [32], restricted Boltzmann machines (RBMs) and DBN were used together with audio files from one female Spanish speaker from the emotional speech dataset [33] as part of large project, INTERFACE, with the big six classes, joy, sadness, anger, fear, disgust and surprise along with neutral. Those authors used two kinds of features for extraction, MFCC and prosodic features with RBM and DBN, providing a maximum classification error rate of 40.82%…”

Section: Classification Based On Six Classesmentioning

confidence: 99%

Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

et al. 2021

View full text Add to dashboard Cite

Robust automatic speech emotional-speech recognition architectures based on hybrid convolutional neural networks (CNN) and feedforward deep neural networks are proposed and named in this paper as: BFN, CNA, and HBN. BFN is a combination between bag-of-Audio-word (BoAW) and feedforward deep neural network, CNA based on CNN, finally, HBN is hybrid architecture between BFN and CNA. Overall accuracy is achieved by leveraging Mel-frequency cepstral coefficient features and bag-of-acoustic-words to feed the network, resulting in promising classification performance. In addition, the concatenated output from the proposed hybrid networks is fed into a softmax layer to produce a probability distribution over categorical classifications for speech recognition. The three proposed models are trained on eight emotional classes from the Ryerson Audio-Visual Database of Emotional Speech and Song audio (RAVDESS) dataset. Our proposed models achieved overall precision between 81.5% and 85.5% and overall accuracy between 80.6% and 84.5%, hence outperforming state-of-the-art models using the same dataset.INDEX TERMS Bag-of-acoustic-words, convolutional neural network, feedforward deep neural network, hybrid features, Mel frequency cepstral coefficients, support vector machine. HESHAM F. A. HAMED received the B.Sc. degree in electrical engineering and the M.Sc. and Ph.D. degrees in electronics and communi-

show abstract

Emotional speech synthesis: from speech database to TTS

Cited by 55 publications

References 6 publications

Speech Emotion Recognition Using Attention Model

Speech Emotion Recognition Using Attention Model

El hablar y los participantes en la interacción comunicativa: cuando el interlocutor es artificial

Efficient Feature-Aware Hybrid Model of Deep Learning Architectures for Speech Emotion Recognition

Contact Info

Product

Resources

About