Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts

Cho, Jaejin; Pappagari, Raghavendra; Kulkarni, Purva; Villalba, Jesús; Carmiel, Yishay; Dehak, Najim

doi:10.21437/interspeech.2018-2466

Cited by 72 publications

(46 citation statements)

References 14 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To measure the performance of systems, we report the weighted accuracy (WA) and unweighted accuracy (UA) averaging over the 10fold cross-validation experiments. We use the same dataset and features as other researchers [7,18]. Table 1 presents performances of proposed approaches for recognizing speech emotion in comparison with various models.…”

Section: Performance Evaluationmentioning

confidence: 99%

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Yoon

Byun

Dey

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data. The proposed framework uses two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance. Furthermore, we propose an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities. The multi-hop attention first computes the relevant segments of the textual data corresponding to the audio signal. The relevant textual data is then applied to attend parts of the audio signal. To evaluate the performance of the proposed system, experiments are performed in the IEMOCAP dataset. Experimental results show that the proposed technique outperforms the state-of-the-art system by 6.5% relative improvement in terms of weighted accuracy.Index Termsspeech emotion recognition, computational paralinguistics, deep learning, natural language processing Fig. 2. Confusion matrix of each model.

show abstract

Section: Performance Evaluationmentioning

confidence: 99%

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Yoon

Byun

Dey

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”

Section: Recent Workmentioning

confidence: 99%

“…A total of 10 unique speakers participated in this work. Following the previous research [9,10,12], we assign single categorical emotion to the utterance with the majority of annotators agreed on the emotion labels. The final dataset contains 7,487 utterances in total (1,103 angry, 1,041 excite, 595 happy, 1,084 sad, 1,849 frustrated, 107 surprise and 1,708 neutral).…”

Section: Dataset and Experimental Setupmentioning

confidence: 99%

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.

show abstract

“…In [1,2,3] feature learning from raw-waveform or spectrogram using CNN, LSTM based models is explored. In [4,5,6,7], CNN and LSTM based models are explored from feature representations such as MFCC and OpenS-MILE [8] features. In [9,10,11,12], adversarial learning paradigm * Both the authors contributed equally to this paper is explored for robust recognition.…”

Section: Introductionmentioning

confidence: 99%

X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition

Pappagari

Wang

Villalba

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models such as the x-vector model. Then, we improve emotion recognition performance by finetuning for emotion classification. We evaluated our experiments on three different types of datasets: IEMOCAP, MSP-Podcast, and Crema-D. By fine-tuning, we obtained 30.40%, 7.99%, and 8.61% absolute improvement on IEMOCAP, MSP-Podcast, and Crema-D respectively over baseline model with no pre-training. Finally, we present results on the effect of emotion on speaker verification. We observed that speaker verification performance is prone to changes in test speaker emotions. We found that trials with angry utterances performed worst in all three datasets. We hope our analysis will initiate a new line of research in the speaker recognition community.

show abstract

Deep Neural Networks for Emotion Recognition Combining Audio and Transcripts

Cited by 72 publications

References 14 publications

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition

Contact Info

Product

Resources

About