Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Gu, Yue; Li, Xinyu; Huang, Kaixiang; Fu, Shiyu; Yang, Kangning; Chen, Shuhong; Zhou, Moliang; Marsic, Ivan

doi:10.1145/3240508.3240714

Cited by 28 publications

(7 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, word-level fusion methods have received substantial research attention and been widely acknowledged for effective exploration of time-dependent interactions (Wang et al, 2019;Zadeh et al, 2018a,b,c;Gu et al, 2018a;Rajagopalan et al, 2016). For example, and Gu et al (2018b) leverage word-level alignment between modalities and explore timerestricted cross-modal dynamics.…”

Section: Related Workmentioning

confidence: 99%

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai¹,

Hu²,

Xing³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

We propose a general strategy named 'divide, conquer and combine' for multimodal fusion. Instead of directly fusing features at holistic level, we conduct fusion hierarchically so that both local and global interactions are considered for a comprehensive interpretation of multimodal embeddings. In the 'divide' and 'conquer' stages, we conduct local fusion by exploring the interaction of a portion of the aligned feature vectors across various modalities lying within a sliding window, which ensures that each part of multimodal embeddings are explored sufficiently. On its basis, global fusion is conducted in the 'combine' stage to explore the interconnection across local interactions, via an Attentive Bi-directional Skipconnected LSTM that directly connects distant local interactions and integrates two levels of attention mechanism. In this way, local interactions can exchange information sufficiently and thus obtain an overall view of multimodal information. Our method achieves state-ofthe-art performance on multimodal affective computing with higher efficiency.

show abstract

Section: Related Workmentioning

confidence: 99%

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Mai¹,

Hu²,

Xing³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…One of the most widely developed solutions is the approach using Deep learning methods (Gu et al, 2018). The use of neural networks to process and analyse emotions is one of the most popular solutions.…”

Section: Literature Reviewmentioning

confidence: 99%

Emotion Detection Based on Sentiment Analysis: An Example of a Social Robots on Short and Long Texts Conversation

Probierz¹,

Gałuszka²

2022

ERSJ

View full text Add to dashboard Cite

Purpose:The aim of this paper is to present a solution to detect emotions from text obtained in a conversation with a social robot. Emotions will be detected using sentiment analysis based on the English and Polish lexicon. Design/methodology/approach: Data from social robot conversation records will be converted into text and then split into short and long speech. The original language utterances will then be analysed using the Polish lexicon, while the translated texts will be analysed using the English emotional lexicon. Findings: The results obtained indicate the same or similar distribution of emotions made by sentiment analysis using both plNetWord and NRC lexicons. Practical Implications: The results obtained can be used for further research addressing the creation and development of lexicons based on the selected language. They are also applicable to the implementation of solutions for detecting and responding to conversational emotions by social robots. Originality/Value: The analyses so far mostly take up the subject of textual analysis in English. The aim of the present study is to analyse a Polish text and to compare the results obtained with those for English texts. The analysis of differences in the emotional sentiment of utterances may lead to the construction of more effective models based on the chosen language.

show abstract

“…Though items can be expressed by multiple ways such as image, video, sound, text and so on, the combined representations of 375 items should require a feature fusion mechanism to ensure that multiple inputs are appropriately integrated. Furthermore, the strategy that synchronizes different inputs of multi-modalities at the same level is an effective way as well (Gu et al, 2018a).…”

Section: Image Embedding With Textual Alignmentmentioning

confidence: 99%

Dynamic attention-based explainable recommendation with textual and visual fusion

Liu

Zhang

Gulla

2020

Information Processing & Management

View full text Add to dashboard Cite

Explainable recommendation, which provides explanations about why an item is recommended, has attracted growing attention in both research and industry communities. However, most existing explainable recommendation methods cannot provide multi-model explanations consisting of both textual and visual modalities or adaptive explanations tailored for the user's dynamic preference, potentially leading to the degradation of customers' satisfaction, confidence and trust for the recommender system. On the technical side, Recurrent Neural Network (RNN) has become the most prevalent technique to model dynamic user preferences. Benefit from the natural characteristics of RNN, the hidden state is a combination of long-term dependency and short-term interest to some degrees. But it works like a black-box and the monotonic temporal dependency of RNN is not sufficient to capture the user's short-term interest.In this paper, to deal with the above issues, we propose a novel Attentive Recurrent Neural Network (Ante-RNN) with textual and visual fusion for the dynamic explainable recommendation. Specifically, our model jointly learns image representations with textual alignment and text representations with topical attention mechanism in a parallel way. Then a novel dynamic contextual attention mechanism is incorporated into Ante-RNN for modelling the complicated correlations among recent items and strengthening the user's short-term interests. By combining the full latent visual-semantic alignments and a hybrid attention mechanism including topical and contextual attentions, Ante-RNN makes the recommendation process more transparent and explainable. Extensive experimental results on two real world datasets demonstrate the superior performance and explainability of our model.

show abstract

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Cited by 28 publications

References 31 publications

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

Emotion Detection Based on Sentiment Analysis: An Example of a Social Robots on Short and Long Texts Conversation

Dynamic attention-based explainable recommendation with textual and visual fusion

Contact Info

Product

Resources

About