Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Chen, Rongfei; Zhou, Wenju; Li, Yang; Zhou, Huiyu

doi:10.1109/tcsvt.2022.3197420

Cited by 11 publications

(7 citation statements)

References 56 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For unimodal results, Table III shows that the visual modality succeeds over the audio modality on benchmark datasets, especially for RAVDESS. It verifies the importance of visual modality for emotion recognition, which is consistent with previous works (Chen et al , 2022b; Praveen et al , 2023). In addition, the reasons why the visual modality is remarkably important on the RAVDESS dataset are speculated as follows.…”

Section: Resultssupporting

confidence: 92%

“…Therefore, we speculate that the great improvements on classification tasks are due to the combination of MAIIM, MACIM and attention-based fusion in our proposed KE-AFN. For the regression task, KE-AFN outperforms existing state-of-the-art method (Chen et al , 2022b) by 1 per cent on MAE and also achieves comparable performance on Corr on par with the state-of-the-art, which indicates KE-AFN works well in fitting specific sentiment scores.…”

Section: Resultsmentioning

confidence: 79%

“…For example, DCCA (Sun et al , 2019) generates multimodal embeddings through canonical correlation analysis, which can only capture linear correlation relationships between different modalities. The remaining baselines (Chen and Li, 2020; Chen et al , 2022b; Han et al , 2021; Mai et al , 2019; Dumpala et al , 2019) all focus on exploring cross-modal dynamics but ignoring intra-modality interactions. In addition, the majority of these baselines (Chen and Li, 2020; Han et al , 2021; Sun et al , 2019; Dumpala et al , 2019) adopts a simple concatenation for producing multimodal features without considering the different importance of different modalities.…”

Section: Resultsmentioning

confidence: 99%

“…In addition, the majority of these baselines (Chen and Li, 2020;Han et al, 2021;Sun et al, 2019;Dumpala et al, 2019) adopts a simple concatenation for producing multimodal features without considering the different importance of different modalities. However, previous studies (Chen et al, 2022b;Praveen et al, 2023) have claimed that the visual modality contributes more for emotion recognition than the audio modality. It indicates that equally treating each modality during the multimodal fusion stage is not appropriate.…”

Section: Comparison With Baselinesmentioning

confidence: 96%

See 3 more Smart Citations

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Yang,

Li,

2023

DTA

View full text Add to dashboard Cite

PurposeAlthough numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations.Design/methodology/approachA novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations.FindingsExtensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance.Originality/valueThe proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

show abstract

Section: Resultssupporting

confidence: 92%

Section: Resultsmentioning

confidence: 79%

Section: Resultsmentioning

confidence: 99%

Section: Comparison With Baselinesmentioning

confidence: 96%

See 2 more Smart Citations

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Yang,

Li,

2023

DTA

View full text Add to dashboard Cite

show abstract

“…The development of augmented reality (AR) technologies and their application in interactive art [1] opens new opportunities for personalization of visual content. Personalization becomes the basis for creating a deeper and more meaningful experience for users, allowing art to adapt to individual preferences and emotional states [2]. However, there are often problems with improving the immersion effect during user interaction with interactive art in augmented reality systems.…”

Section: Introductionmentioning

confidence: 99%

Personalization of Visual Content of Interactive Art in Augmented Reality Based on Individual User Preferences

Kuliahin

2024

CNCS

View full text Add to dashboard Cite

Topicality. In connection with the development of AR technologies and their use in interactive art, there is a growing need to develop methods of personalizing visual content, focused on the individual preferences of users. Research methods. Neural collaborative filtering method, generalized matrix factorization method, mood analysis on video. The purpose of the article: Researching the possibilities of improving the personalization of visual content in interactive art by evaluating the emotional reactions of users and their implicit feedback. The results obtained. The application of neural collaborative filtering and generalized matrix factorization to create adapted visual content in interactive art in AR was considered, which will significantly increase the relevance and immersion of users in interactive works. Conclusion. The considered approach can be used to improve immersiveness and personalization during user interaction with interactive art in AR.

show abstract

Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospects

Khan,

Xu,

Liu

et al. 2024

Multimedia Systems

View full text Add to dashboard Cite

In recent years, emotion recognition has received significant attention, presenting a plethora of opportunities for application in diverse fields such as human–computer interaction, psychology, and neuroscience, to name a few. Although unimodal emotion recognition methods offer certain benefits, they have limited ability to encompass the full spectrum of human emotional expression. In contrast, Multimodal Emotion Recognition (MER) delivers a more holistic and detailed insight into an individual's emotional state. However, existing multimodal data collection approaches utilizing contact-based devices hinder the effective deployment of this technology. We address this issue by examining the potential of contactless data collection techniques for MER. In our tertiary review study, we highlight the unaddressed gaps in the existing body of literature on MER. Through our rigorous analysis of MER studies, we identify the modalities, specific cues, open datasets with contactless cues, and unique modality combinations. This further leads us to the formulation of a comparative schema for mapping the MER requirements of a given scenario to a specific modality combination. Subsequently, we discuss the implementation of Contactless Multimodal Emotion Recognition (CMER) systems in diverse use cases with the help of the comparative schema which serves as an evaluation blueprint. Furthermore, this paper also explores ethical and privacy considerations concerning the employment of contactless MER and proposes the key principles for addressing ethical and privacy concerns. The paper further investigates the current challenges and future prospects in the field, offering recommendations for future research and development in CMER. Our study serves as a resource for researchers and practitioners in the field of emotion recognition, as well as those intrigued by the broader outcomes of this rapidly progressing technology.

show abstract

Video-Based Cross-Modal Auxiliary Network for Multimodal Sentiment Analysis

Cited by 11 publications

References 56 publications

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Personalization of Visual Content of Interactive Art in Augmented Reality Based on Individual User Preferences

Exploring contactless techniques in multimodal emotion recognition: insights into diverse applications, challenges, solutions, and prospects

Contact Info

Product

Resources

About