Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations

Hsu, Jia-Hao; Su, Ming-Hsiang; Wu, Chung-Hsien; Chen, Yi‐Hsuan

doi:10.1109/taslp.2021.3076364

Cited by 40 publications

(11 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Out of the selected studies, 39/51 (76.5%) utilized English language datasets with IEMOCAP being used in 30/39 of those cases. 6 of the remaining studies employed 6 databases in Chinese [28][29][30][31][32][33], 2 databases each in French [34,35] and Dutch [36,37], and 1 data set each for Bengali [38], Greek [39], Malay [40], Indonesian [41], and Hungarian [42]. Two studies did not specify the spoken language of the used datasets [43,44].…”

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

“…OpenSMILEbased features [32,50,53]); (b) deep-learned features extracted from the raw waveform or image by means of DL (e.g. ResNet18 [32]) or pre-trained transfer-learned feature extractors, e.g. Wav2vec [54]), here accounting for 25.5%(13/51) of total studies; (c) image transformations, summing to 19.6%(10/51) as yielded by advanced signal processing methods of raw waveforms, such as, spectrograms [48,55] or Mel-Frequency Cepstral Coefficients (MFCCs) [47,56]; (d) hybrid approaches, as combinations of two or three of the aforementioned options, here appearing in 25.5%(13/51) of study items.…”

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

See 1 more Smart Citation

Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis

Alhussein,

Ziogas,

Saleem

et al. 2023

Preprint

View full text Add to dashboard Cite

Purpose: Manifestations of emotion in social conversational interactions stand at a focal point in the rapidly growing affective computing area, with applications in healthcare, education and human-computer interaction. Artificial intelligence (AI) holds great potential in modelling the challenging dynamic nature of affect in speech conversation. In this paper, we analyze and criticize latest trends and open problems through a systematic review and multi-subgroup meta-analysis of AI approaches for emotion recognition in conversation (ERC). Methods: We adopt the PRISMA-DTA guidelines toward analysis of AI-driven speech ERC. A comprehensive database search through predefined query strings and selection criteria allowed for data extraction of essential diagnostic performance parameters. We analyze salient patterns related to methodological quality and risk of bias. Univariate random-effects models are then designed with a multi-subgroup perspective, centered around affective annotations models, while encompassing the ERC parameters of modalities, feature extraction and conversation style. Results: 51 studies were systematically reviewed for qualitative analysis, whereas 27 articles were included in the meta-analysis. Diagnostic test performance manifested with high heterogeneity, with intriguing insights regarding affective state annotation, input modality, feature extraction methods, and dataset conversation style. Conclusion: Our research contributes fine-grained insights as recommendations that tackle open-problems in ERC. While providing valuable information on diagnostic performance of AI in speech ERC, we underscore the imperative need for further advancements in annotations and models capable of handling diverse emotional expressions.

show abstract

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

Section: Characteristics Of the Included Studiesmentioning

confidence: 99%

Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis

Alhussein,

Ziogas,

Saleem

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The DB consists of both discrete and continuous emotion annotations carried out by 49 annotators consisting of students and professors. The DB has been used for emotion recognition using only speech by considering the nonverbal vocalization in conversations depicting emotions [122]. They used only the audio portion of the DB and used an LSTM to acquire the shifts in the dialogue of the speaker's emotion from a sequence of segmented speech signals.…”

Section: Nnimementioning

confidence: 99%

A Survey on Databases for Multimodal Emotion Recognition and an Introduction to the VIRI (Visible and InfraRed Image) Database

Siddiqui

Dhakal

Yang

et al. 2022

MTI

View full text Add to dashboard Cite

Multimodal human–computer interaction (HCI) systems pledge a more human–human-like interaction between machines and humans. Their prowess in emanating an unambiguous information exchange between the two makes these systems more reliable, efficient, less error prone, and capable of solving complex tasks. Emotion recognition is a realm of HCI that follows multimodality to achieve accurate and natural results. The prodigious use of affective identification in e-learning, marketing, security, health sciences, etc., has increased demand for high-precision emotion recognition systems. Machine learning (ML) is getting its feet wet to ameliorate the process by tweaking the architectures or wielding high-quality databases (DB). This paper presents a survey of such DBs that are being used to develop multimodal emotion recognition (MER) systems. The survey illustrates the DBs that contain multi-channel data, such as facial expressions, speech, physiological signals, body movements, gestures, and lexical features. Few unimodal DBs are also discussed that work in conjunction with other DBs for affect recognition. Further, VIRI, a new DB of visible and infrared (IR) images of subjects expressing five emotions in an uncontrolled, real-world environment, is presented. A rationale for the superiority of the presented corpus over the existing ones is instituted.

show abstract

“…A multi-classifier emotion recognition model based on prosodic information and semantic labels is introduced in [5]. Similarly, the semantic labels and the non-verbal audio in speech, such as onomatopoeia such as crying, laughter, or sighing, are used in SER [6]. Subsequently, temporal and semantic coherence is introduced for SER [7].…”

Section: Introductionmentioning

confidence: 99%

Cross-corpus speech emotion recognition using subspace learning and domain adaption

Jia

Pai

2022

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Speech emotion recognition (SER) is a hot topic in speech signal processing. When the training data and the test data come from different corpus, their feature distributions are different, which leads to the degradation of the recognition performance. Therefore, in order to solve this problem, a cross-corpus speech emotion recognition method is proposed based on subspace learning and domain adaptation in this paper. Specifically, training set data and the test set data are used to form the source domain and target domain, respectively. Then, the Hessian matrix is introduced to obtain the subspace for the extracted features in both source and target domains. In addition, an information entropy-based domain adaption method is introduced to construct the common space. In the common space, the difference between the feature distributions in the source domain and target domain is reduced as much as possible. To evaluate the performance of the proposed method, extensive experiments are conducted on cross-corpus speech emotion recognition. Experimental results show that the proposed method achieves better performance compared with some existing subspace learning and domain adaptation methods.

show abstract

Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations

Cited by 40 publications

References 41 publications

Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis

Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis

A Survey on Databases for Multimodal Emotion Recognition and an Introduction to the VIRI (Visible and InfraRed Image) Database

Cross-corpus speech emotion recognition using subspace learning and domain adaption

Contact Info

Product

Resources

About