Multimodal Speech Emotion Recognition Using Cross Attention with Aligned Audio and Text

Lee, Yoonhyung; Yoon, Seunghyun; Jung, Kyomin

doi:10.21437/interspeech.2020-2312

Cited by 12 publications

(6 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To align with previous studies [20], we use 7,487 utterances from seven emotions: frustration, neutral, anger, sadness, excitement, happiness, surprise. Since there is no standard split for this dataset, we follow [20,14] to perform 10-fold cross-validation, where 8:1:1 are used for training, validation and test, respectively. The weighted accuracy (WA, i.e., the overall accuracy) and unweighted accuracy (UA, i.e., the average accuracy over all emotion categories) is adopted as the evaluation metrics.…”

Section: Datasetsmentioning

confidence: 99%

“…4. CAN [14] aggregates the sequential information from aligned audio and text by using the attention weights of each modality in a normal and crossed way.…”

Section: Baselinesmentioning

confidence: 99%

“…Generally, these works can be categorized into three types. The first type builds independent models for each modality and combines their outputs for final emotion classification [11,12,13,14]. Different architectures can be adopted for each modality to best suit different inputs.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition

Sun

Liu

Tao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) requires a thorough understanding of both the linguistic content of an utterance (i.e., textual information) and how the speaker utters it (i.e., acoustic information). The one vital challenge in SER is how to effectively fuse these two kinds of information. In this paper, we propose a novel Multimodal Cross-and Self-Attention Network (MCSAN) to tackle this problem. The core of MCSAN is to employ the parallel cross-and selfattention modules to explicitly model both inter-and intra-modal interactions of audio and text. Specifically, the cross-attention module utilizes the cross-attention mechanism to guide one modality to attend to the other modality and update the features accordingly. Similarly, the self-attention module employs the self-attention mechanism to propagate information within each modality. We evaluate MCSAN on two benchmark datasets, IEMOCAP and MELD. Experimental results demonstrate that our proposed model achieves stateof-the-art performance on both datasets.

show abstract

Section: Datasetsmentioning

confidence: 99%

“…4. CAN [14] aggregates the sequential information from aligned audio and text by using the attention weights of each modality in a normal and crossed way.…”

Section: Baselinesmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition

Sun

Liu

Tao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Then, the connection between features and annotations is characterized by SVR with its scaling parameter in {0.01n F , 0.1n F , n F , 10n F } and the same selections of regularization parameter as in the SVMs. In addition, neural networks are employed considering 12 selections of the hidden-layer neurons as: (32,8), (32,16), (64,16), . .…”

Section: A Experimental Preparation 1) Corpus and Featuresmentioning

confidence: 99%

“…T HE past two decades witnessed the rapid progressing in paralinguistics of auditory affective computing [1]- [3], consisting of emotion recognition in speech [4]- [6], music [7], and multimodal conditions [8]. Typically, in the research of speech emotion recognition (SER), machines learn to perceive emotional information in speech with the learning procedures appearing in certain settings [5].…”

mentioning

confidence: 99%

Rethinking Auditory Affective Descriptors Through Zero-Shot Emotion Recognition in Speech

Deng²,

Fan

et al. 2022

IEEE Trans. Comput. Soc. Syst.

View full text Add to dashboard Cite

Zero-shot speech emotion recognition (SER) endows machines with the ability of sensing unseen-emotional states in speech, compared with conventional SER endeavors on supervised cases. On addressing the zero-shot SER task, auditory affective descriptors (AADs) are typically employed to transfer affective knowledge from seen-to unseen-emotional states. However, it remains unknown which types of AADs can well describe emotional states in speech during the transfer. In this regard, we define and research on three types of AADs, namely, peremotion semantic-embedding, per-emotion manually annotated, and per-sample manually annotated AADs, through zero-shot emotion recognition in speech. This leads to a systematic design including prototype-and annotation-based zero-shot SER modules, relying on the input from per-emotion and per-sample AADs, respectively. We then perform extensive experimental comparisons between human and machines' AADs on the French emotional speech corpus CINEMO for positive-negative (PN) and within-negative (WN) tasks. The experimental results indicate that semantic-embedding prototypes from pretrained models can outperform manually annotated emotional dimensions in zeroshot SER. The results further demonstrate that it is possible for machines to understand and describe affective information in speech better than human beings, with the help of sufficient pretrained models.

show abstract