Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes

Ito, Koichiro; Fujioka, Takuya; Sun, Qinghua; Nagamatsu, Kenji

doi:10.21437/interspeech.2021-809

Cited by 2 publications

(1 citation statement)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CMU MOSEI dataset is one of the most popular datasets used for multimodal emotion recognition. It has been heavily referenced with 129 citations on Scopus from which 35 papers 25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59 directly utilize CMU MOSEI features in their application, most of which utilize deep-learning architectures for their analysis. This existing research on the CMU MOSEI dataset however does not explore the explainability of the CMU MOSEI features.…”

Section: Problem With Pre-extracted Featuresmentioning

confidence: 99%

Evaluating Significant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods

Shaikh

Khalane

Makwana

et al. 2023

Preprint

View full text Add to dashboard Cite

Analysis of human emotions from multimodal data for making critical decisions is an emerging area of research. The evolution of deep learning algorithms has improved the potential for extracting value from multimodal data. However, these algorithms do not often explain how certain outputs from the data are produced. This study focuses on the risks of using black-box deep learning models for critical tasks, such as emotion recognition, and describes how human understandable interpretations of these models are extremely important. This study utilizes one of the largest multimodal datasets available - CMU-MOSEI. Many researchers have used the pre-extracted features provided by the CMU Multimodal SDK with black-box deep learning models making it difficult to interpret the contribution of individual features. This study describes the implications of individual features from various modalities (audio, video, text) in Context-Aware Multimodal Emotion Recognition. It describes the process of curating reduced feature models by using the GradientSHAP XAI method. These reduced models with highly contributing features achieve comparable and even better results compared to their corresponding all feature models as well as the baseline model GraphMFN proving that carefully selecting significant features can help improve the model robustness and performance and in turn make it trustworthy.

show abstract

Section: Problem With Pre-extracted Featuresmentioning

confidence: 99%

Evaluating Significant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods

Shaikh

Khalane

Makwana

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Evaluating significant features in context‐aware multimodal emotion recognition with XAI methods

et al. 2023

View full text Add to dashboard Cite

Expert systems are being extensively used to make critical decisions involving emotional analysis in affective computing. The evolution of deep learning algorithms has improved the potential for extracting value from multimodal emotional data. However, these black‐box algorithms do not often explain the heuristics behind processing the input features for achieving certain outputs. This study focuses on the risks of using black‐box deep learning models for critical tasks, such as emotion recognition, and describes how human understandable interpretations of the workings of these models are extremely important. This study utilizes one of the largest multimodal datasets available–CMU‐MOSEI. Many researchers have used the pre‐extracted features provided by the CMU Multimodal SDK with black‐box deep learning models making it difficult to interpret the contribution of its individual features. This study describes the implications of significant features from various modalities (audio, video, text) identified using XAI in Multimodal Emotion Recognition. It describes the process of curating reduced feature models by using the Gradient SHAP XAI method. These reduced models with highly contributing features achieve comparable and at times even better results compared to their corresponding all‐feature models as well as the baseline model GraphMFN. This study reveals that carefully selecting significant features for a model can help filter out irrelevant features, and attenuate the noise or bias caused by them, leading to an improved performance efficiency of the expert systems by making them transparent, easily interpretable, and trustworthy.

show abstract

Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes

Cited by 2 publications

References 23 publications

Evaluating Significant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods

Evaluating Significant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods

Evaluating significant features in context‐aware multimodal emotion recognition with XAI methods

Contact Info

Product

Resources

About