Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

Yang, Dingkang; Huang, Shuai; Liu, Yang; Zhang, Lihua

doi:10.1109/lsp.2022.3210836

Cited by 32 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They used deep-learning models such as InceptionV3, VGG16, and VGG19 to achieve a maximum accuracy of 93.33% [ 17 ]. In [ 18 ], the authors presented a contextual cross-modal transformer module for the fusion of textual and audio modalities operated on IEMOCAP and MELD datasets to achieve a maximum accuracy of 84.27%. In [ 19 ], the authors illustrated a speech recognition technique on frequency domain features of an Arabic dataset using SVM, KNN, and MLP techniques to achieve a maximum recognition accuracy of 77.14%.…”

Section: Related Workmentioning

confidence: 99%

Design and Development of a Non-Contact ECG-Based Human Emotion Recognition System Using SVM and RF Classifiers

2023

View full text Add to dashboard Cite

Emotion recognition becomes an important aspect in the development of human-machine interaction (HMI) systems. Positive emotions impact our lives positively, whereas negative emotions may cause a reduction in productivity. Emotionally intelligent systems such as chatbots and artificially intelligent assistant modules help make our daily life routines effortless. Moreover, a system which is capable of assessing the human emotional state would be very helpful to assess the mental state of a person. Hence, preventive care could be offered before it becomes a mental illness or slides into a state of depression. Researchers have always been curious to find out if a machine could assess human emotions precisely. In this work, a unimodal emotion classifier system in which one of the physiological signals, an electrocardiogram (ECG) signal, has been used is proposed to classify human emotions. The ECG signal was acquired using a capacitive sensor-based non-contact ECG belt system. The machine-learning-based classifiers developed in this work are SVM and random forest with 10-fold cross-validation on three different sets of ECG data acquired for 45 subjects (15 subjects in each age group). The minimum classification accuracies achieved with SVM and RF emotion classifier models are 86.6% and 98.2%, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Design and Development of a Non-Contact ECG-Based Human Emotion Recognition System Using SVM and RF Classifiers

2023

View full text Add to dashboard Cite

show abstract

“…Yang et al incorporate context data into the current speech by embedding prior statements between interlocutors, which improves the emotional depiction of the present utterance. The suggested cross-modal converter module then focuses on the interconnections between text and auditory modalities, adaptively fostering modality fusion (Yang et al, 2022 ). Based on the proposed papers listed above, it is clear that multimodality currently plays a significant role in HRI research.…”

Section: Recent Advancements Of Application For Multi-modal Human–rob...mentioning

confidence: 99%

Recent advancements in multimodal human–robot interaction

Yang

et al. 2023

Front. Neurorobot.

View full text Add to dashboard Cite

Robotics have advanced significantly over the years, and human–robot interaction (HRI) is now playing an important role in delivering the best user experience, cutting down on laborious tasks, and raising public acceptance of robots. New HRI approaches are necessary to promote the evolution of robots, with a more natural and flexible interaction manner clearly the most crucial. As a newly emerging approach to HRI, multimodal HRI is a method for individuals to communicate with a robot using various modalities, including voice, image, text, eye movement, and touch, as well as bio-signals like EEG and ECG. It is a broad field closely related to cognitive science, ergonomics, multimedia technology, and virtual reality, with numerous applications springing up each year. However, little research has been done to summarize the current development and future trend of HRI. To this end, this paper systematically reviews the state of the art of multimodal HRI on its applications by summing up the latest research articles relevant to this field. Moreover, the research development in terms of the input signal and the output signal is also covered in this manuscript.

show abstract

“…Previous studies have shown that more effective and valuable joint multimodal representations can be obtained by combining complementary features in different modalities (Shraga et al 2020;Springstein, Müller-Budack, and Ewerth 2021), benefiting from the evolution of learning-based techniques (Yang et al 2023c;Chen et al 2024;Li, Yang, and Zhang 2023;Yang et al 2023d). Most MSA works (Hazarika, Zimmermann, and Poria 2020;Yu et al 2021;Yang et al 2022aYang et al ,d, 2023bYang et al , 2022bLi, Wang, and Cui 2023) are based on the assumptions that all modalities are available during the training and testing phases. In real applications, the assumption will not hold due to many inevitable factors, such as privacy, device, or security constraints, resulting in significant degradation of model performance.…”

Section: Introductionmentioning

confidence: 99%

A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities

Li,

Yang,

Lei

et al. 2024

AAAI

View full text Add to dashboard Cite

Multimodal Sentiment Analysis (MSA) has attracted widespread research attention recently. Most MSA studies are based on the assumption of modality completeness. However, many inevitable factors in real-world scenarios lead to uncertain missing modalities, which invalidate the fixed multimodal fusion approaches. To this end, we propose a Unified multimodal Missing modality self-Distillation Framework (UMDF) to handle the problem of uncertain missing modalities in MSA. Specifically, a unified self-distillation mechanism in UMDF drives a single network to automatically learn robust inherent representations from the consistent distribution of multimodal data. Moreover, we present a multi-grained crossmodal interaction module to deeply mine the complementary semantics among modalities through coarse- and fine-grained crossmodal attention. Eventually, a dynamic feature integration module is introduced to enhance the beneficial semantics in incomplete modalities while filtering the redundant information therein to obtain a refined and robust multimodal representation. Comprehensive experiments on three datasets demonstrate that our framework significantly improves MSA performance under both uncertain missing-modality and complete-modality testing conditions.

show abstract

Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition

Cited by 32 publications

References 27 publications

Design and Development of a Non-Contact ECG-Based Human Emotion Recognition System Using SVM and RF Classifiers

Design and Development of a Non-Contact ECG-Based Human Emotion Recognition System Using SVM and RF Classifiers

Recent advancements in multimodal human–robot interaction

A Unified Self-Distillation Framework for Multimodal Sentiment Analysis with Uncertain Missing Modalities

Contact Info

Product

Resources

About