CTNet: Conversational Transformer Network for Emotion Recognition

Lian, Zheng; Liu, Bin; Tao, Jianhua

doi:10.1109/taslp.2021.3049898

Cited by 142 publications

(79 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 3 compares the performance of the proposed model with the existing studies that also implemented the multimodal architecture and tested it on the MELD. Most of the previous studies [ 36 , 39 ] only considered the audio and text modalities. However, the study by Siriwardhana et al [ 38 ] proposed a multimodal fusion model for combining the modality of audio, face, and text and achieved state-of-the-art results.…”

Section: Resultsmentioning

confidence: 99%

“…The transformer is a network architecture that purely depends on the attention mechanism without any recurrent structure [ 35 ]. The latest studies focused on using attention mechanisms to fuse different modalities of features for MMER [ 36 , 37 , 38 , 39 ]. Ho et al [ 36 ] proposed a multimodal approach based on a multilevel multi-head fusion attention mechanism and RNN to combine audio and text modalities for emotion estimation.…”

Section: Related Studiesmentioning

confidence: 99%

“…The previous study investigated the use of the crossmodal transformer to reinforce a target modality by introducing the features from another modality, which also learns the attention across these two modalities’ features [ 40 ]. One recent study [ 39 ] proposed a multimodal learning framework based on the crossmodal transformer target for conversational emotion recognition, combining word-level features and segment-level acoustic features as the inputs. The results demonstrated the effectiveness of the proposed transformer fusion method.…”

Section: Related Studiesmentioning

confidence: 99%

“…Even though the effectiveness of combining two different modalities by using the attention mechanism has been widely studied, the challenge emerges when the need arises to combine three or more modalities due to the structure of multi-head attention. For this reason, most previous studies on MMER based on the attention mechanism proposed and tested the network architecture for only two modalities [ 36 , 37 , 39 ]. The study [ 38 ] deployed models for combining three modalities with transformer-based fusion, but a simple feature concatenation was added at the end to combine different modalities’ features.…”

Section: Related Studiesmentioning

confidence: 99%

See 3 more Smart Citations

Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Xie

Sidulova

Park

2021

Sensors

View full text Add to dashboard Cite

Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Studiesmentioning

confidence: 99%

Section: Related Studiesmentioning

confidence: 99%

Section: Related Studiesmentioning

confidence: 99%

See 2 more Smart Citations

Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Xie

Sidulova

Park

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…For categorical emotion recognition, most of the state-ofthe-arts utilize Accuracy(or called Recall) [96] and F 1 score to evaluate the performance of models. Here we suppose there are C emotion classes in a dataset.…”

Section: ) Categoricalmentioning

confidence: 99%

Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey

Zhang¹,

Tan²

2021

Preprint

View full text Add to dashboard Cite

With the development of social media and human-computer interaction, video has become one of the most common data formats. As a research hotspot, emotion recognition system is essential to serve people by perceiving people’s emotional state in videos. In recent years, a large number of studies focus on tackling the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues due to the lack of review papers concentrating on the three modalities. On the other hand, because of the effective leverage of deep learning techniques to learn latent representation for emotion recognition, this paper focuses on the emotion recognition method based on deep learning techniques. In this paper, we firstly introduce widely accepted emotion models for the purpose of interpreting the definition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the definition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to find out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.

show abstract