Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

Wang, Yansen; Shen, Ying; Liu, Zhun; Liang, Paul Pu; Zadeh, Amir; Morency, Louis–Philippe

doi:10.1609/aaai.v33i01.33017216

Cited by 285 publications

(195 citation statements)

References 15 publications

(18 reference statements)

Supporting

Mentioning

176

Contrasting

Order By: Relevance

“…We compare HFFN with following multimodal algorithms: RMFN (Liang et al, 2018a), MFN (Zadeh et al, 2018a), MCTN (Pham et al, 2019), BC-LSTM (Poria et al, 2017b), TFN , MARN (Zadeh et al, 2018b), LMF ), MFM (Tsai et al, 2019, MR-RF (Barezi et al, 2018), FAF (Gu et al, 2018b), RAVEN (Wang et al, 2019), GMFN (Zadeh et al, 2018c), Memn2n (Sukhbaatar et al, 2015), MM-B2 , CHFusion (Majumder et al, 2018), SVM Trees (Rozgic et al, 2012), CMN , C-MKL (Poria et al, 2016b) and CAT-LSTM (Poria et al, 2017c).…”

Section: Comparison With Baselinesmentioning

confidence: 99%

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai¹,

Hu²,

Xing³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We propose a general strategy named 'divide, conquer and combine' for multimodal fusion. Instead of directly fusing features at holistic level, we conduct fusion hierarchically so that both local and global interactions are considered for a comprehensive interpretation of multimodal embeddings. In the 'divide' and 'conquer' stages, we conduct local fusion by exploring the interaction of a portion of the aligned feature vectors across various modalities lying within a sliding window, which ensures that each part of multimodal embeddings are explored sufficiently. On its basis, global fusion is conducted in the 'combine' stage to explore the interconnection across local interactions, via an Attentive Bi-directional Skipconnected LSTM that directly connects distant local interactions and integrates two levels of attention mechanism. In this way, local interactions can exchange information sufficiently and thus obtain an overall view of multimodal information. Our method achieves state-ofthe-art performance on multimodal affective computing with higher efficiency.

show abstract

Section: Comparison With Baselinesmentioning

confidence: 99%

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai¹,

Hu²,

Xing³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Despite the popularity of this strategy, it has been shown to fail to fully capture cross-modal interactions [14,15]. Consequently, several multimodal feature representation strategies have been proposed for various applications [16,14,15,17]. Our work continues this line of research by investigating multimodal feature representation strategies for spoken words, as evaluated on the task of word importance prediction.…”

Section: Related Workmentioning

confidence: 97%

“…Rather than considering these two modalities as independent observations of speech, we focus on their cross-modal interaction to obtain a unified representation. We recognize that non-verbal cues during face-to-face communications contribute to influencing how humans understand spoken words [17]. Prosody is one such channel in spoken dialogue that is important in conversational speech, where speakers attach prosodic prominence to words (or sub-word components) to help listeners disambiguate meaning [24,25,26].…”

Section: Related Workmentioning

confidence: 99%

Fusion Strategy for Prosodic and Lexical Representations of Word Importance

2019

View full text Add to dashboard Cite

We investigate whether, and if so when, prosodic features in spoken dialogue aid in modeling the importance of words to the overall meaning of a dialogue turn. Starting from the assumption that acoustic-prosodic cues help identify important speech content, we investigate representation architectures that combine lexical and prosodic features and evaluate them for predicting word importance. We propose an attention-based feature fusion strategy and additionally show how the addition of strategic supervision of the attention weights results in especially competitive models. We evaluate our fusion strategy on spoken dialogues and demonstrate performance increases over state-ofthe-art models. Specifically, our approach both achieves the lowest root mean square error on test data and generalizes better over out-of-vocabulary words.

show abstract

“…e experimental results show that Bi-LSTM framework prevail over traditional HMM framework. Wang et al [23] applied a the recurrent attended variation embedding network (RAVEN) for the multimodal emotion recognition, and the LSTM is used to extract features from single mode. e multimodal emotion studies above utilized the deep neural network model, and the results outperformed the traditional methods.…”

Section: Related Workmentioning

confidence: 99%

Audio‐Textual Emotion Recognition Based on Improved Neural Networks

Cai

Dong

et al. 2019

Mathematical Problems in Engineering

View full text Add to dashboard Cite

With the rapid development in social media, single-modal emotion recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to optimize the performance of the emotional recognition system, a multimodal emotion recognition model from speech and text was proposed in this paper. Considering the complementarity between different modes, CNN (convolutional neural network) and LSTM (long short-term memory) were combined in a form of binary channels to learn acoustic emotion features; meanwhile, an effective Bi-LSTM (bidirectional long short-term memory) network was resorted to capture the textual features. Furthermore, we applied a deep neural network to learn and classify the fusion features. The final emotional state was determined by the output of both speech and text emotion analysis. Finally, the multimodal fusion experiments were carried out to validate the proposed model on the IEMOCAP database. In comparison with the single modal, the overall recognition accuracy of text increased 6.70%, and that of speech emotion recognition soared 13.85%. Experimental results show that the recognition accuracy of our multimodal is higher than that of the single modal and outperforms other published multimodal models on the test datasets.

show abstract

Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors

Cited by 285 publications

References 15 publications

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Fusion Strategy for Prosodic and Lexical Representations of Word Importance

Audio‐Textual Emotion Recognition Based on Improved Neural Networks

Contact Info

Product

Resources

About