End-to-End Recurrent Cross-Modality Attention for Video Dialogue

Chu, Yun-Wei; Lin, Kuan-Yen; Hsu, Chao-Chun; Ku, Lun-Wei

doi:10.1109/taslp.2021.3065852

Cited by 3 publications

(4 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To boost performances, transformerbased VGD systems (Li et al, 2021b) are utilized on top of large-scale pre-trained language models (Radford et al, 2019;Raffel et al, 2020). Another immense challenge is keeping track of extended dialogue context, and video, where memory networks (Lin et al, 2019;Xie and Iacobacci, 2020) and multi-step attention (Chu et al, 2020) were introduced to efficiently store the video and long episodic dialogue. Graph representations (Kim et al, 2021;Pham et al, 2022; were also popular solutions for holding semantic commonalities between the dialogue and video.…”

Section: Video-grounded Dialoguesmentioning

confidence: 99%

“…9 See more scheduling functions in Appendix B.2. (Lin et al, 2019) 0.641 0.493 0.388 0.310 0.241 0.527 0.912 JMAN (Chu et al, 2020) 0.667 0.521 0.413 0.334 0.239 0.533 0.941 CMU (Sanabria et al, 2019) 0.718 0.584 0.478 0.394 0.267 0.563 1.094 COST (Pham et al, 2022) 0.723 0.589 0.483 0.400 0.266 0.561 1.085 MSTN ---0.377 0.275 0.566 1.115 JSTL (Hori et al, 2019b) 0 video to be freely utilized for their purposes. Therefore, L SAL is optimized when the training iteration number is odd, and L RLE does in the even number:…”

Section: Optimization and Inferencementioning

confidence: 99%

See 1 more Smart Citation

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Yoon,

Kim,

Yoon

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.

show abstract

Section: Video-grounded Dialoguesmentioning

confidence: 99%

Section: Optimization and Inferencementioning

confidence: 99%

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Yoon,

Kim,

Yoon

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

“…Therefore, how to effectively realize the multi-modal representation learning and cross-modal semantic relation reasoning on rich underlying semantic structures of visual information and dialogue context is one of the key challenge. Researches propose to model images or videos and dialogue as the graph structure [10,34,203] and conduct cross attention-based reasoning [17,118,139] to perform fine-grained cross-modal relation reasoning for reasonable responses generation, see details in section 3.3.…”

Section: Research Challenges In Vadmentioning

confidence: 99%

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Wang¹,

Guo²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual-Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.CCS Concepts: • Human-centered computing → HCI theory, concepts and models; • Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

“…where α is a hyperparameter and the objective function is a minimax problem, we alternate to train and update the parameters θ and φ in every epoch. AVSD@DSTC7 Methods B1 B2 B3 B4 M R C Baseline (Hori et al, 2019a) 0.621 0.480 0.379 0.305 0.217 0.481 0.733 HMA (Le et al, 2019a) 0.633 0.490 0.386 0.310 0.242 0.515 0.856 RMFF (Yeh et al, 2019) 0.636 0.510 0.417 0.345 0.224 0.505 0.877 EE-DMN 0.641 0.493 0.388 0.310 0.241 0.527 0.912 JMAN (Chu et al, 2020) 0…”

Section: Text Hallucination Regularizationmentioning

confidence: 99%

Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Yoon¹,

Eunseop²,

Yoon³

et al. 2022

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question regarding a given video and dialogue context. Despite the recent success of multi-modal reasoning to generate answer sentences, existing dialogue systems still suffer from a text hallucination problem, which denotes indiscriminate text-copying from input texts without an understanding of the question. This is due to learning spurious correlations from the fact that answer sentences in the dataset usually include the words of input texts, thus the VGD system excessively relies on copying words from input texts by hoping those words to overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating (THAM) framework, which incorporates Text Hallucination Regularization (THR) loss derived from the proposed information-theoretic text hallucination measurement approach. Applying THAM with current dialogue systems validates the effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows enhanced interpretability.

show abstract

End-to-End Recurrent Cross-Modality Attention for Video Dialogue

Cited by 3 publications

References 60 publications

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Contact Info

Product

Resources

About