Emotion recognition using acoustic and lexical features

Rozgić, Viktor; Ananthakrishnan, Sankaranarayanan; Saleem, Shirin; Kumar, Rohit; Vembu, Aravind Namandi; Prasad, Rohit

doi:10.21437/interspeech.2012-118

Cited by 35 publications

(9 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Early fusion is the fusion approach used in the pre-extraction phase of the data. Rozgic et al [ 18 ] used early fusion to connect multimodal representations as input to an inference model, which provides a novel idea for modal fusion. Zadeh et al [ 19 ] designed a memory fusion network (MFN) using multiview sequential learning, which explicitly illustrates two interactions in the neural architecture.…”

Section: Related Workmentioning

confidence: 99%

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Quan

Sun

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

Multimodal sentiment analysis has been an active subfield in natural language processing. This makes multimodal sentiment tasks challenging due to the use of different sources for predicting a speaker’s sentiment. Previous research has focused on extracting single contextual information within a modality and trying different modality fusion stages to improve prediction accuracy. However, a factor that may lead to poor model performance is that this does not consider the variability between modalities. Furthermore, existing fusion methods tend to extract the representational information of individual modalities before fusion. This ignores the critical role of intermodal interaction information for model prediction. This paper proposes a multimodal sentiment analysis method based on cross-modal attention and gated cyclic hierarchical fusion network MGHF. MGHF is based on the idea of distribution matching, which enables modalities to obtain representational information with a synergistic effect on the overall sentiment orientation in the temporal interaction phase. After that, we designed a gated cyclic hierarchical fusion network that takes text-based acoustic representation, text-based visual representation, and text representation as inputs and eliminates redundant information through a gating mechanism to achieve effective multimodal representation interaction fusion. Our extensive experiments on two publicly available and popular multimodal datasets show that MGHF has significant advantages over previous complex and robust baselines.

show abstract

Section: Related Workmentioning

confidence: 99%

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Quan

Sun

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

show abstract

“…Due to their ubiquity, most works on multimodal emotion recognition have focused on combining audio and video [56,66], but how to combine them remains an open question. In early fusion, inputs or raw feature representations are merged before they are fed into a joint network [17,57]. In modellevel fusion, each modality is processed by a dedicated network before both intermediate feature representations are merged and then passed through a joint network [16,54].…”

Section: Multimodal Emotion Recognitionmentioning

confidence: 99%

Gaze-enhanced Crossmodal Embeddings for Emotion Recognition

Abdou,

Sood,

Müller

et al. 2022

Preprint

View full text Add to dashboard Cite

Emotional expressions are inherently multimodal -integrating facial behavior, speech, and gaze -but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and videoonly emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide detailed insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.

show abstract

“…For modality aggregation, Viktor et al [36] use early fusion to concatenate multi-modal features as the input for the inference models. But it ignores the mismatch between different modalities.…”

Section: Related Work 21 Multi-modal Emotion Recognitionmentioning

confidence: 99%

“…Such rich information from multimodalities could be used to understand the emotional state [29]. Previous research works have shown that different modalities are complementary for emotion recognition [23,36]. Different modalities all carry emotion relevant information and how to effectively combine multiple modalities has been an active research focus.…”

Section: Introductionmentioning

confidence: 99%

Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching

Liang

Jin

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Automatic emotion recognition is an active research topic with wide range of applications. Due to the high manual annotation cost and inevitable label ambiguity, the development of emotion recognition dataset is limited in both scale and quality. Therefore, one of the key challenges is how to build effective models with limited data resource. Previous works have explored different approaches to tackle this challenge including data enhancement, transfer learning, and semi-supervised learning etc. However, the weakness of these existing approaches includes such as training instability, large performance loss during transfer, or marginal improvement. In this work, we propose a novel semi-supervised multi-modal emotion recognition model based on cross-modality distribution matching, which leverages abundant unlabeled data to enhance the model training under the assumption that the inner emotional status is consistent at the utterance level across modalities. We conduct extensive experiments to evaluate the proposed model on two benchmark datasets, IEMOCAP and MELD. The experiment results prove that the proposed semi-supervised learning model can effectively utilize unlabeled data and combine multi-modalities to boost the emotion recognition performance, which outperforms other state-of-the-art approaches under the same condition. The proposed model also achieves competitive capacity compared with existing approaches which take advantage of additional auxiliary information such as speaker and interaction context. CCS Concepts• Computing methodologies → Semi-supervised learning settings; Semantic networks; • Human-centered computing → HCI design and evaluation methods.

show abstract

Emotion recognition using acoustic and lexical features

Cited by 35 publications

References 0 publications

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Gaze-enhanced Crossmodal Embeddings for Emotion Recognition

Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching

Contact Info

Product

Resources

About