Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts

Sebastian, Jilt; Pierucci, Piero

doi:10.21437/interspeech.2019-3201

Cited by 49 publications

(21 citation statements)

References 21 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [19], emotional keywords are exploited to effectively identify the classes. Recently in [9,10,20], a long shortterm memory (LSTM) based network has been explored to encode the information of both modalities. Furthermore, there have been some attempts to fuse the modalities using the inter-attention mechanism [11,12].…”

Section: Recent Workmentioning

confidence: 99%

See 1 more Smart Citation

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we explore the impact of visual modality in addition to speech and text for improving the accuracy of the emotion detection system. The traditional approaches tackle this task by fusing the knowledge from the various modalities independently for performing emotion classification. In contrast to these approaches, we tackle the problem by introducing an attention mechanism to combine the information. In this regard, we first apply a neural network to obtain hidden representations of the modalities. Then, the attention mechanism is defined to select and aggregate important parts of the video data by conditioning on the audio and text data. Furthermore, the attention mechanism is again applied to attend important parts of the speech and textual data, by considering other modality. Experiments are performed on the standard IEMOCAP dataset using all three modalities (audio, text, and video). The achieved results show a significant improvement of 3.65% in terms of weighted accuracy compared to the baseline system.

show abstract

Section: Recent Workmentioning

confidence: 99%

“…Figure 1 shows the architecture of the proposed AMH model. Previous research used multi-modal information independently using a neural network model by fusion information over each modality [9,20]. Recently, researchers also investigated an interattention mechanism over the modality [11,12].…”

Section: Proposed Attentive Modality Hopping Mechanismmentioning

confidence: 99%

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon

Dey

Lee

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In order to take advantage of the linguistic content in SER, the fusion of both textual and audio information gains on popularity [32], [33], [34]. Three strategies are usually applied for multi-modal fusion: (a) at the feature level by concatenating the inputs of different modalities, (b) at the decision level with majority voting, or (c) at the model level by merging intermediate representations [7], [10], [35], [36]. More precisely, the fused model (c) is done by concatenated outputs of two distinct networks corresponding to each modality to feed next layers [37].…”

Section: Modality Fusionmentioning

confidence: 99%

Mutual impact of acoustic and linguistic representations for continuous emotion recognition in call-center conversations

Tahon¹,

Macary²,

Estève³

et al. 2021

Preprint

View full text Add to dashboard Cite

<div> <div> <div> <p>The goal of our research is to automaticaly retrieve the satisfaction and the frustration in real-life call-center conversations. This study focuses an industrial application in which the customer satisfaction is continuously tracked down to improve customer services. To compensate the lack of large annotated emotional databases, we explore the use of pre-trained speech representations as a form of transfer learning towards AlloSat corpus. Moreover, several studies have pointed out that emotion can be detected not only in speech but also in facial trait, in biological response or in textual information. In the context of telephone conversations, we can break down the audio information into acoustic and linguistic by using the speech signal and its transcription. Our experiments confirms the large gain in performance obtained with the use of pre-trained features. Surprisingly, we found that the linguistic content is clearly the major contributor for the prediction of satisfaction and best generalizes to unseen data. Our experiments conclude to the definitive advantage of using CamemBERT representations, however the benefit of the fusion of acoustic and linguistic modalities is not as obvious. With models learnt on individual annotations, we found that fusion approaches are more robust to the subjectivity of the annotation task. This study also tackles the problem of performances variability and intends to estimate this variability from different views: weights initialization, confidence intervals and annotation subjectivity. A deep analysis on the linguistic content investigates interpretable factors able to explain the high contribution of the linguistic modality for this task. </p> </div> </div> </div>

show abstract

“…Traditional feature-level fusion and decision-level fusion, as well as the current tensor-level fusion trend, are widely used [1]. Furthermore, many researchers have also compared the performance of various fusion methods [2]. All of these works demonstrated the utility of combining linguistic information in SER.…”

Section: Introductionmentioning

confidence: 99%

Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Li¹,

Bell²,

Lai³

2021

Preprint

View full text Add to dashboard Cite

Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the baseline results achieved by combining ground-truth transcripts. In addition, we also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.

show abstract

Fusion Techniques for Utterance-Level Emotion Recognition Combining Speech and Transcripts

Cited by 49 publications

References 21 publications

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Mutual impact of acoustic and linguistic representations for continuous emotion recognition in call-center conversations

Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Contact Info

Product

Resources

About