2020
DOI: 10.48550/arxiv.2001.06206
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System

Abstract: Understanding dynamic scenes and dialogue contexts in order to converse with users has been challenging for multimodal dialogue systems. The 8-th Dialog System Technology Challenge (DSTC8) (Seokhwan Kim 2019) proposed an Audio Visual Scene-Aware Dialog (AVSD) task (Hori et al. 2018), which contains multiple modalities including audio, vision, and language, to evaluate how dialogue systems understand different modalities and response to users. In this paper, we proposed a multi-step joint-modality attention net… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 17 publications
0
4
0
Order By: Relevance
“…Baselines. To demonstrate the effectiveness of our proposed model, we compare with several baseline methods: (i) Baseline [5], (ii) JMAN [19], an multi-step joint-modality attention network, (iii) RLM [6], which uses pre-trained language model to process multimodal inputs, (iv) SCGA [20], using co-reference graph attention to deduce correlation among multimodalities, and (v) PDC [21], that uses both semantic graph and pre-trained language model. For a thorough comparison, we also implement another variant of our model: BART (I3D+VGGish), which directly feeds the offline-extracted video features into BART as video embeddings.…”
Section: Resultsmentioning
confidence: 99%
“…Baselines. To demonstrate the effectiveness of our proposed model, we compare with several baseline methods: (i) Baseline [5], (ii) JMAN [19], an multi-step joint-modality attention network, (iii) RLM [6], which uses pre-trained language model to process multimodal inputs, (iv) SCGA [20], using co-reference graph attention to deduce correlation among multimodalities, and (v) PDC [21], that uses both semantic graph and pre-trained language model. For a thorough comparison, we also implement another variant of our model: BART (I3D+VGGish), which directly feeds the offline-extracted video features into BART as video embeddings.…”
Section: Resultsmentioning
confidence: 99%
“…Li et al [ 22 ] proposed a transformer-based generative framework that integrates all the modalities by encoding features into the system and generates better multimodal-based system responses using multi-task learning methods. Chu et al [ 38 ] described a consecutive multimodal fusion strategy using joint modal attention during conversation. Although these approaches exhibit significant performances, they have two limitations.…”
Section: Related Workmentioning
confidence: 99%
“…Early attempts [14,24,33,35] employ the recurrent neural network to encode dialog history. Later methods [6,30,49] use the Attention mechanism, or [45] design a memory networks to extract the relationship between different modalities, [22,29,26] employ Transformer-based network to resolve the cross modality learning. More explicit relationships in dialogs are recently studied, showing promising results [12,23].…”
Section: Related Workmentioning
confidence: 99%