M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

Mittal, Trisha; Bhattacharya, Uttaran; Chandra, Rohan; Bera, Aniket; Manocha, Dinesh

doi:10.1609/aaai.v34i02.5492

Cited by 180 publications

(111 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…"A", "T", and "V" refer to the audio, text, and video modalities respectively. Results from [6] are not included since the test setting was not clear. Results from [13] and [32] were not obtained using leave-one-speaker-out 10-fold CV and thus not directly comparable.…”

Section: Use Of Asr Transcriptionsmentioning

confidence: 99%

“…Although significant progress has been made [1][2][3], AER is still a challenging research problem since human emotions are inherently complex, ambiguous, and highly personal. Humans often express their emotions using multiple simultaneous approaches, such as voice characteristics, linguistic content, facial expressions, and body actions, which makes AER by nature a complex multimodal task [4][5][6]. Furthermore, due to the difficulties in data collection, publicly available datasets often do not have enough speakers to properly cover personal variations in emotion expression.…”

Section: Introductionmentioning

confidence: 99%

“…For instance, various types of acoustic features can be fused with text features derived either from pre-trained word embeddings [10,11] or from a jointly trained neural network component [12,13]. Context-dependent hierarchical fusion [14,15], multi-head attention mechanisms [13], and multiplicative fusion [6] have been applied to emotion recognition.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

Zhang

Woodland

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, a novel two-branch neural network model structure is proposed for multimodal emotion recognition, which consists of a time synchronous branch (TSB) and a time asynchronous branch (TAB). To capture correlations between each word and its acoustic realisation, the TSB combines speech and text modalities at each input window frame and then uses pooling across time to form a single embedding vector. The TAB, by contrast, provides cross-utterance information by integrating sentence text embeddings from a number of context utterances into another embedding vector. The final emotion classification uses both the TSB and the TAB embeddings. Experimental results on the IEMOCAP dataset demonstrate that the two-branch structure achieves state-of-the-art results in 4-way classification with all common test setups. When using automatic speech recognition (ASR) output instead of manually transcribed reference text, it is shown that the cross-utterance information considerably improves robustness against ASR errors. Furthermore, by incorporating an extra class for all the other emotions, the final 5-way classification system with ASR hypotheses can be viewed as a prototype for more realistic emotion recognition systems.

show abstract

Section: Use Of Asr Transcriptionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

Zhang

Woodland

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…The studies focusing on multimodal fusion experiment on multimodal data for the purpose of accurate improvement of emotion recognition, but are short of the empirical evidence to prove the effectiveness of models when some of modalities are unavailable. Mittal et al [95] propose M3ER that utilizes Modality Check Step to replace unavailable modality with proxy feature and fuses multimodal features by multiplicative fusion module. M3ER is a promising technique but similarly lack of experiments in unimodality and bimodality.…”

Section: Unified Modelmentioning

confidence: 99%

Deep Emotion Recognition in Dynamic Data using Facial, Speech and Textual Cues: A Survey

Zhang¹,

Tan²

2021

Preprint

View full text Add to dashboard Cite

With the development of social media and human-computer interaction, video has become one of the most common data formats. As a research hotspot, emotion recognition system is essential to serve people by perceiving people’s emotional state in videos. In recent years, a large number of studies focus on tackling the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues due to the lack of review papers concentrating on the three modalities. On the other hand, because of the effective leverage of deep learning techniques to learn latent representation for emotion recognition, this paper focuses on the emotion recognition method based on deep learning techniques. In this paper, we firstly introduce widely accepted emotion models for the purpose of interpreting the definition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the definition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to find out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.

show abstract

“…However, in general, most multi-modal fusion techniques require for the testing phase the simultaneous presence of all the modalities that were used during the model training phase [1]. This requirement becomes a severe limitation in case one or more sensors are missing or their signals are severely corrupted by noise during testing, unless such situations are explicitly handled by the modelling framework [8]. Thus, it would be desirable to improve the testing performance of individual modalities using other modalities during training [3][9] [10].…”

Section: Introductionmentioning

confidence: 99%

Robust Latent Representations Via Cross-Modal Translation and Alignment

Rajan

Brutti

Cavallaro

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Multi-modal learning relates information across observation modalities of the same physical phenomenon to leverage complementary information. Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. This is a limitation when signals from some modalities are unavailable or severely degraded. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment to improve the representations of a worse performing (or weaker) modality. The translation from the weaker to the better performing (or stronger) modality generates a multi-modal intermediate encoding that is representative of both modalities. This encoding is then correlated with the stronger modality representation in a shared latent space. We validate the proposed framework on the AVEC 2016 dataset (RECOLA) for continuous emotion recognition and show the effectiveness of the framework that achieves state-ofthe-art (uni-modal) performance for weaker modalities.

show abstract

M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

Cited by 180 publications

References 24 publications

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

Emotion Recognition by Fusing Time Synchronous and Time Asynchronous Representations

Deep Emotion Recognition in Dynamic Data using Facial, Speech and Textual Cues: A Survey

Robust Latent Representations Via Cross-Modal Translation and Alignment

Contact Info

Product

Resources

About