Proceedings of the 26th ACM International Conference on Multimedia 2018
DOI: 10.1145/3240508.3240714
|View full text |Cite
|
Sign up to set email alerts
|

Human Conversation Analysis Using Attentive Multimodal Networks with Hierarchical Encoder-Decoder

Abstract: Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoderdecoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 28 publications
(7 citation statements)
references
References 31 publications
0
6
0
Order By: Relevance
“…Recently, word-level fusion methods have received substantial research attention and been widely acknowledged for effective exploration of time-dependent interactions (Wang et al, 2019;Zadeh et al, 2018a,b,c;Gu et al, 2018a;Rajagopalan et al, 2016). For example, and Gu et al (2018b) leverage word-level alignment between modalities and explore timerestricted cross-modal dynamics.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, word-level fusion methods have received substantial research attention and been widely acknowledged for effective exploration of time-dependent interactions (Wang et al, 2019;Zadeh et al, 2018a,b,c;Gu et al, 2018a;Rajagopalan et al, 2016). For example, and Gu et al (2018b) leverage word-level alignment between modalities and explore timerestricted cross-modal dynamics.…”
Section: Related Workmentioning
confidence: 99%
“…One of the most widely developed solutions is the approach using Deep learning methods (Gu et al, 2018). The use of neural networks to process and analyse emotions is one of the most popular solutions.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Though items can be expressed by multiple ways such as image, video, sound, text and so on, the combined representations of 375 items should require a feature fusion mechanism to ensure that multiple inputs are appropriately integrated. Furthermore, the strategy that synchronizes different inputs of multi-modalities at the same level is an effective way as well (Gu et al, 2018a).…”
Section: Image Embedding With Textual Alignmentmentioning
confidence: 99%