2021
DOI: 10.1109/taslp.2021.3065852
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Recurrent Cross-Modality Attention for Video Dialogue

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 60 publications
0
4
0
Order By: Relevance
“…To boost performances, transformerbased VGD systems (Li et al, 2021b) are utilized on top of large-scale pre-trained language models (Radford et al, 2019;Raffel et al, 2020). Another immense challenge is keeping track of extended dialogue context, and video, where memory networks (Lin et al, 2019;Xie and Iacobacci, 2020) and multi-step attention (Chu et al, 2020) were introduced to efficiently store the video and long episodic dialogue. Graph representations (Kim et al, 2021;Pham et al, 2022; were also popular solutions for holding semantic commonalities between the dialogue and video.…”
Section: Video-grounded Dialoguesmentioning
confidence: 99%
See 1 more Smart Citation
“…To boost performances, transformerbased VGD systems (Li et al, 2021b) are utilized on top of large-scale pre-trained language models (Radford et al, 2019;Raffel et al, 2020). Another immense challenge is keeping track of extended dialogue context, and video, where memory networks (Lin et al, 2019;Xie and Iacobacci, 2020) and multi-step attention (Chu et al, 2020) were introduced to efficiently store the video and long episodic dialogue. Graph representations (Kim et al, 2021;Pham et al, 2022; were also popular solutions for holding semantic commonalities between the dialogue and video.…”
Section: Video-grounded Dialoguesmentioning
confidence: 99%
“…9 See more scheduling functions in Appendix B.2. (Lin et al, 2019) 0.641 0.493 0.388 0.310 0.241 0.527 0.912 JMAN (Chu et al, 2020) 0.667 0.521 0.413 0.334 0.239 0.533 0.941 CMU (Sanabria et al, 2019) 0.718 0.584 0.478 0.394 0.267 0.563 1.094 COST (Pham et al, 2022) 0.723 0.589 0.483 0.400 0.266 0.561 1.085 MSTN ---0.377 0.275 0.566 1.115 JSTL (Hori et al, 2019b) 0 video to be freely utilized for their purposes. Therefore, L SAL is optimized when the training iteration number is odd, and L RLE does in the even number:…”
Section: Optimization and Inferencementioning
confidence: 99%
“…Therefore, how to effectively realize the multi-modal representation learning and cross-modal semantic relation reasoning on rich underlying semantic structures of visual information and dialogue context is one of the key challenge. Researches propose to model images or videos and dialogue as the graph structure [10,34,203] and conduct cross attention-based reasoning [17,118,139] to perform fine-grained cross-modal relation reasoning for reasonable responses generation, see details in section 3.3.…”
Section: Research Challenges In Vadmentioning
confidence: 99%
“…where α is a hyperparameter and the objective function is a minimax problem, we alternate to train and update the parameters θ and φ in every epoch. AVSD@DSTC7 Methods B1 B2 B3 B4 M R C Baseline (Hori et al, 2019a) 0.621 0.480 0.379 0.305 0.217 0.481 0.733 HMA (Le et al, 2019a) 0.633 0.490 0.386 0.310 0.242 0.515 0.856 RMFF (Yeh et al, 2019) 0.636 0.510 0.417 0.345 0.224 0.505 0.877 EE-DMN 0.641 0.493 0.388 0.310 0.241 0.527 0.912 JMAN (Chu et al, 2020) 0…”
Section: Text Hallucination Regularizationmentioning
confidence: 99%