MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

Hu, Dou; Hou, Xiaolong; Wei, Lingwei; Jiang, Lian-Xin; Yang, Ming

doi:10.1109/icassp43922.2022.9747397

Cited by 54 publications

(24 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MMGCN [13] constructs a fully connected graph to model multimodal and long-distance contextual information, and speaker embeddings are added for encoding speaker information. MM-DFN [17] designs a graphbased dynamic fusion module to reduce redundancy and enhance complementarity between modalities. MMTr [24] preserves the integrity of main modal representations and enhances weak modal representations by using multi-head attention.…”

Section: A Emotion Recognition In Conversationsmentioning

confidence: 99%

“…MM-DFN [17]: It designs a graph-based dynamic fusion module to fuse multimodal context features, and this module could reduce redundancy and enhance complementarity between modalities.…”

Section: Baselinesmentioning

confidence: 99%

“…While DialogueTRM [10] designs hierarchical transformer and multi-grained interaction fusion modules to explore intra-and inter-modal emotional behaviors, it ignores inter-modal interactions between different utterances. MMGCN [13] and MM-DFN [17] are graph-based fusion methods that require manually constructed graph structures to represent conversations. On the other hand, existing methods rely on the designed model to learn modal representations, but no work focuses on further improving modal representations using model-agnostic techniques for ERC.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

Wang

Lin

et al. 2024

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Emotion recognition in conversations (ERC), the task of recognizing the emotion of each utterance in a conversation, is crucial for building empathetic machines. Existing studies focus mainly on capturing context-and speakersensitive dependencies on the textual modality but ignore the significance of multimodal information. Different from emotion recognition in textual conversations, capturing intra-and intermodal interactions between utterances, learning weights between different modalities, and enhancing modal representations play important roles in multimodal ERC. In this paper, we propose a transformer-based model with self-distillation (SDT) 1 for the task. The transformer-based model captures intra-and inter-modal interactions by utilizing intra-and inter-modal transformers, and learns weights between modalities dynamically by designing a hierarchical gated fusion strategy. Furthermore, to learn more expressive modal representations, we treat soft labels of the proposed model as extra training supervision. Specifically, we introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality. Experiments on IEMOCAP and MELD datasets demonstrate that SDT outperforms previous state-of-the-art baselines.

show abstract

Section: A Emotion Recognition In Conversationsmentioning

confidence: 99%

“…MM-DFN [17]: It designs a graph-based dynamic fusion module to fuse multimodal context features, and this module could reduce redundancy and enhance complementarity between modalities.…”

Section: Baselinesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

Wang

Lin

et al. 2024

IEEE Trans. Multimedia

View full text Add to dashboard Cite

show abstract

“…It can be observed that SpeechFormer++ with HuBERT features noticeably outperforms the previous works by a large margin of +3.1% WF1 and +1.2% WA. When compared under the hand-crafted features, SpeechFormer++ outperforms ConGCN [78], MMFA-RNN [19], MM-DFN [18] and CTNet [37] in terms of WF1. Note that SpeechFormer++ is simply applied in MELD and does not utilize the context and speaker information.…”

Section: B Speech Emotion Recognition On Meldmentioning

confidence: 99%

“…Recently, deep learning methods have delivered superior performance for PSP tasks owing to their remarkable modeling capabilities. For example, convolutional neural networks (CNNs) [10]- [16], graph neural networks (GNNs) [17], [18], recurrent neural networks (RNNs) [19]- [21] and two popular variants of the RNNs named long shortterm memory (LSTM) [22]- [24] and gated recurrent units (GRUs) [25] have achieved promising results in PSP domain.…”

Section: Introductionmentioning

confidence: 99%

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Chen,

Xing,

et al. 2023

Preprint

View full text Add to dashboard Cite

Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. Recently, Transformer has achieved remarkable success in the natural language processing field and has demonstrated its adaptation to speech. However, previous works on Transformer in the speech field have not incorporated the properties of speech, leaving the full potential of Transformer unexplored. In this paper, we consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. More concretely, following the component relationship in the speech signal, we design a unit encoder to model the intra-and inter-unit information (i.e., frames, phones, and words) efficiently. According to the hierarchical relationship, we utilize merging blocks to generate features at different granularities, which is consistent with the structural pattern in the speech signal. Moreover, a word encoder is introduced to integrate word-grained features into each unit encoder, which effectively balances fine-grained and coarse-grained information. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks. The results show that SpeechFormer++ outperforms the standard Transformer while greatly reducing the computational cost. Furthermore, it delivers superior results compared to the state-of-the-art approaches.

show abstract

Implementing Dialogue Emotion Recognition via Recursive Propagation of Explicit Emotions

Zhang,

Yang,

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations

Cited by 54 publications

References 21 publications

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Implementing Dialogue Emotion Recognition via Recursive Propagation of Explicit Emotions

Contact Info

Product

Resources

About