Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations. Negative-shifted word representation Original word representation Positive-shifted word representation Visual Acoustic ⋯ excited voice raised eyebrows Visual Acoustic ⋯ soft voice shock Word Representation Space

show abstract

“…Following prior practice Gu et al 2018), we adopted the same feature extraction scheme for language, visual and acoustic modalities.…”

Section: Unimodal Feature Representationsmentioning

confidence: 99%

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

Wang

Shen

Liu

et al. 2019

AAAI

285

176

View full text Add to dashboard Cite

show abstract

“…We compare HFFN with following multimodal algorithms: RMFN (Liang et al, 2018a), MFN (Zadeh et al, 2018a), MCTN (Pham et al, 2019), BC-LSTM (Poria et al, 2017b), TFN , MARN (Zadeh et al, 2018b), LMF ), MFM (Tsai et al, 2019, MR-RF (Barezi et al, 2018), FAF (Gu et al, 2018b), RAVEN (Wang et al, 2019), GMFN (Zadeh et al, 2018c), Memn2n (Sukhbaatar et al, 2015), MM-B2 , CHFusion (Majumder et al, 2018), SVM Trees (Rozgic et al, 2012), CMN , C-MKL (Poria et al, 2016b) and CAT-LSTM (Poria et al, 2017c).…”

Section: Comparison With Baselinesmentioning

confidence: 99%

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai¹,

Hu²,

Xing³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We propose a general strategy named 'divide, conquer and combine' for multimodal fusion. Instead of directly fusing features at holistic level, we conduct fusion hierarchically so that both local and global interactions are considered for a comprehensive interpretation of multimodal embeddings. In the 'divide' and 'conquer' stages, we conduct local fusion by exploring the interaction of a portion of the aligned feature vectors across various modalities lying within a sliding window, which ensures that each part of multimodal embeddings are explored sufficiently. On its basis, global fusion is conducted in the 'combine' stage to explore the interconnection across local interactions, via an Attentive Bi-directional Skipconnected LSTM that directly connects distant local interactions and integrates two levels of attention mechanism. In this way, local interactions can exchange information sufficiently and thus obtain an overall view of multimodal information. Our method achieves state-ofthe-art performance on multimodal affective computing with higher efficiency.

show abstract

“…Authors use the same feature set with the one described in subsection 3.1. FAF [20]: uses hierarchical attention with bidirectional gated recurrent units at word level and a fine tuning attention mechanism at each extracted representation. The extracted feature vector is passed to a CNN which performs the final decision.…”

Section: Baseline Modelsmentioning

confidence: 99%

“…In [19], a word level alignment between all modalities is proposed. Following the aforementioned idea, authors in [20] use a hierarchical attention architecture. Specifically, they pretrain recurrent networks in order to perform single modal sentiment classification.…”

Section: Introductionmentioning

confidence: 99%

Deep Hierarchical Fusion with Application in Sentiment Analysis

Georgiou¹,

Charilaos²,

Potamianos

2019

Interspeech 2019

View full text Add to dashboard Cite

Recognizing the emotional tone in spoken language is a challenging research problem that requires modeling not only the acoustic and textual modalities separately but also their crossinteractions. In this work, we introduce a hierarchical fusion scheme for sentiment analysis of spoken sentences. Two bidirectional Long-Short-Term-Memory networks (BiLSTM), followed by multiple fully connected layers, are trained in order to extract feature representations for each of the textual and audio modalities. The representations of the unimodal encoders are both fused at each layer and propagated forward, thus achieving fusion at the word, sentence and high/sentiment levels. The proposed approach of deep hierarchical fusion achieves stateof-the-art results for sentiment analysis tasks. Through an ablation study, we show that the proposed fusion method achieves greater performance gains over the unimodal baseline compared to other fusion approaches in the literature.

show abstract

Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment

Cited by 121 publications

References 30 publications

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Deep Hierarchical Fusion with Application in Sentiment Analysis

Contact Info

Product

Resources

About