Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis

Akhtar, Md. Shad; Chauhan, Dushyant Singh; Ghosal, Deepanway; Poria, Soujanya; Ekbal, Asif; Bhattacharyya, Pushpak

doi:10.18653/v1/n19-1034

Cited by 136 publications

(44 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Third, this study demonstrates the superiority of the proposed AV-TFN method through comparisons of the performances of the visual network (VN) [7], audio network (AN), and audio-visual network (AVN) with concatenation (AVN-Concat) [8] and attention (AVN-Atten) [9] techniques. The experiment results show that AV-TFN significantly improves F1 score compared with AN, VN, AVN-Concat, and AVN-Atten methods, while also achieving speeds similar to that of the fast VN method.…”

mentioning

confidence: 84%

Audio-Visual Tensor Fusion Network for Piano Player Posture Classification

Park

2020

Applied Sciences

View full text Add to dashboard Cite

Playing the piano in the correct position is important because the correct position helps to produce good sound and prevents injuries. Many studies have been conducted in the field of piano playing posture recognition that combines various techniques. Most of these techniques are based on analyzing visual information. However, in the piano education field, it is essential to utilize audio information in addition to visual information due to the deep relationship between posture and sound. In this paper, we propose an audio-visual tensor fusion network (simply, AV-TFN) for piano performance posture classification. Unlike existing studies that used only visual information, the proposed method uses audio information to improve the accuracy in classifying the postures of professional and amateur pianists. For this, we first introduce a dataset called C3Pap (Classic piano performance postures of amateur and professionals) that contains actual piano performance videos in diverse environments. Furthermore, we propose a data structure that represents audio-visual information. The proposed data structure represents audio information on the color scale and visual information on the black and white scale for representing relativeness between them. We call this data structure an audio-visual tensor. Finally, we compare the performance of the proposed method with state-of-the-art approaches: VN (Visual Network), AN (Audio Network), AVN (Audio-Visual Network) with concatenation and attention techniques. The experiment results demonstrate that AV-TFN outperforms existing studies and, thus, can be effectively used in the classification of piano playing postures.

show abstract

mentioning

confidence: 84%

Audio-Visual Tensor Fusion Network for Piano Player Posture Classification

Park

2020

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…On the other hand Pham et al (2018) introduced multi-modal sequence-to-sequence models which perform specially well in bi-modal settings. Finally, Akhtar et al (2019) proposed a multi-modal, multi-task approach in which the inputs from a video (text, acoustic and visual frames), are exploited for simultaneously predicting the sentiment and expressed emotions of an utterance. Our work is related to all of these approaches, but it is different in that we apply multi-modal techniques not only for sentiment classification, but also for aspect extraction.…”

Section: Related Workmentioning

confidence: 99%

A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

Marrese-Taylor¹,

Rodríguez²,

Balazs³

et al. 2020

Second Grand-Challenge and Workshop on Multimodal Language (Challenge-Hml)

View full text Add to dashboard Cite

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multimodal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.

show abstract

“…various state-ofthe-art systems for both sentiment and emotion analysis. Very recently, Akhtar et al (2019) in-troduced an attention based multi-task learning framework for sentiment and emotion classification on the CMU-MOSEI dataset.…”

Section: Related Workmentioning

confidence: 99%

Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis

Chauhan

Akhtar

Ekbal

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

In recent times, multi-modal analysis has been an emerging and highly sought-after field at the intersection of natural language processing, computer vision, and speech processing. The prime objective of such studies is to leverage the diversified information, (e.g., textual, acoustic and visual), for learning a model. The effective interaction among these modalities often leads to a better system in terms of performance. In this paper, we introduce a recurrent neural network based approach for the multi-modal sentiment and emotion analysis. The proposed model learns the intermodal interaction among the participating modalities through an auto-encoder mechanism. We employ a context-aware attention module to exploit the correspondence among the neighboring utterances. We evaluate our proposed approach for five standard multi-modal affect analysis datasets. Experimental results suggest the efficacy of the proposed model for both sentiment and emotion analysis over various existing state-of-the-art systems.

show abstract

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis

Cited by 136 publications

References 22 publications

Audio-Visual Tensor Fusion Network for Piano Player Posture Classification

Audio-Visual Tensor Fusion Network for Piano Player Posture Classification

A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

Context-aware Interactive Attention for Multi-modal Sentiment and Emotion Analysis

Contact Info

Product

Resources

About