Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Yoon, Seunghyun; Dey, Subhadeep; Lee, Hwanhee; Jung, Kyomin

doi:10.1109/icassp40776.2020.9054229

Cited by 28 publications

(27 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MCSAN outperforms the state-of-the-art MHA model by 6.9%. We also present the performance of AMH [20] in Table 1. AMH is a tri-modal version of MHA by incorporating the visual information into MHA's framework.…”

Section: Comparison To State-of-the-art Methodsmentioning

confidence: 99%

“…To align with previous studies [20], we use 7,487 utterances from seven emotions: frustration, neutral, anger, sadness, excitement, happiness, surprise. Since there is no standard split for this dataset, we follow [20,14] to perform 10-fold cross-validation, where 8:1:1 are used for training, validation and test, respectively. The weighted accuracy (WA, i.e., the overall accuracy) and unweighted accuracy (UA, i.e., the average accuracy over all emotion categories) is adopted as the evaluation metrics.…”

Section: Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition

Sun

Liu

Tao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) requires a thorough understanding of both the linguistic content of an utterance (i.e., textual information) and how the speaker utters it (i.e., acoustic information). The one vital challenge in SER is how to effectively fuse these two kinds of information. In this paper, we propose a novel Multimodal Cross-and Self-Attention Network (MCSAN) to tackle this problem. The core of MCSAN is to employ the parallel cross-and selfattention modules to explicitly model both inter-and intra-modal interactions of audio and text. Specifically, the cross-attention module utilizes the cross-attention mechanism to guide one modality to attend to the other modality and update the features accordingly. Similarly, the self-attention module employs the self-attention mechanism to propagate information within each modality. We evaluate MCSAN on two benchmark datasets, IEMOCAP and MELD. Experimental results demonstrate that our proposed model achieves stateof-the-art performance on both datasets.

show abstract

Section: Comparison To State-of-the-art Methodsmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition

Sun

Liu

Tao

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…To obtain more detailed information, more attention would be invested to the target area. In the meantime, it suppresses other useless information [55][56][57][58][59][60].…”

Section: A Cbam: Attention-based Conv-bilstmmentioning

confidence: 99%

Attention-Based Convolution Skip Bidirectional Long Short-Term Memory Network for Speech Emotion Recognition

2021

View full text Add to dashboard Cite

“…For example, a driver emotion detection system can automatically infer the driver’s emotional state and take corresponding measures to ensure road safety and human health [ 19 ]. Many previous research efforts on emotion recognition process the information from different modalities and use multimodal clues to infer the emotional states, and they show improvement of the overall performance by multimodal fusion [ 20 ]. Tzirakis et al applied convolutional neural networks to extract features from the speech and the facial expression and utilized long short-term memory networks to model the context and improve the performance [ 21 ].…”

Section: Related Workmentioning

confidence: 99%

Skeleton-Based Emotion Recognition Based on Two-Stream Self-Attention Enhanced Spatial-Temporal Graph Convolutional Network

Shi

Liu

Ishi

et al. 2020

Sensors

View full text Add to dashboard Cite

Emotion recognition has drawn consistent attention from researchers recently. Although gesture modality plays an important role in expressing emotion, it is seldom considered in the field of emotion recognition. A key reason is the scarcity of labeled data containing 3D skeleton data. Some studies in action recognition have applied graph-based neural networks to explicitly model the spatial connection between joints. However, this method has not been considered in the field of gesture-based emotion recognition, so far. In this work, we applied a pose estimation based method to extract 3D skeleton coordinates for IEMOCAP database. We propose a self-attention enhanced spatial temporal graph convolutional network for skeleton-based emotion recognition, in which the spatial convolutional part models the skeletal structure of the body as a static graph, and the self-attention part dynamically constructs more connections between the joints and provides supplementary information. Our experiment demonstrates that the proposed model significantly outperforms other models and that the features of the extracted skeleton data improve the performance of multimodal emotion recognition.

show abstract

Attentive Modality Hopping Mechanism for Speech Emotion Recognition

Cited by 28 publications

References 27 publications

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition

Multimodal Cross- and Self-Attention Network for Speech Emotion Recognition

Attention-Based Convolution Skip Bidirectional Long Short-Term Memory Network for Speech Emotion Recognition

Skeleton-Based Emotion Recognition Based on Two-Stream Self-Attention Enhanced Spatial-Temporal Graph Convolutional Network

Contact Info

Product

Resources

About