Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.269
|View full text |Cite
|
Sign up to set email alerts
|

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Abstract: Visual dialog is a challenging vision-language task, where a dialog agent needs to answer a series of questions through reasoning on the image content and dialog history. Prior work has mostly focused on various attention mechanisms to model such intricate interactions. By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks. The model is unified in that ( 1) it captures all t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
55
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 56 publications
(55 citation statements)
references
References 35 publications
0
55
0
Order By: Relevance
“…While multi-head attention has been widely exploited in many vision-language (VL) tasks, such as image captioning (Zhou et al, 2020), visual question answering (Tan and Bansal, 2019;, and visual dialog (Kang et al, 2019;Wang et al, 2020), its potential benefit to model flexible cross-media posts has been previously ignored. Due to the informal style in social media, cross-media keyphrase prediction brings unique difficulties mainly in two aspects: first, its textimage relationship is rather complicated (Vempala and Preotiuc-Pietro, 2019) while in conventional VL tasks the two modalities have most semantics shared; second, social media images usually exhibit a more diverse distribution and a much higher probability of containing OCR tokens ( §4), thereby posing a hurdle for effectively processing.…”
Section: Related Workmentioning
confidence: 99%
“…While multi-head attention has been widely exploited in many vision-language (VL) tasks, such as image captioning (Zhou et al, 2020), visual question answering (Tan and Bansal, 2019;, and visual dialog (Kang et al, 2019;Wang et al, 2020), its potential benefit to model flexible cross-media posts has been previously ignored. Due to the informal style in social media, cross-media keyphrase prediction brings unique difficulties mainly in two aspects: first, its textimage relationship is rather complicated (Vempala and Preotiuc-Pietro, 2019) while in conventional VL tasks the two modalities have most semantics shared; second, social media images usually exhibit a more diverse distribution and a much higher probability of containing OCR tokens ( §4), thereby posing a hurdle for effectively processing.…”
Section: Related Workmentioning
confidence: 99%
“…Our method improves a lot based on the LTMI (Nguyen et al, 2020). As shown in Table 3, our approach outperforms VDBERT (Wang et al, 2020) ‡ which trains from scratch without extra datasets. All the comparison show that our approach is valid due to explicit relation modeling.…”
Section: Generative Resultsmentioning
confidence: 90%
“…Yes Q4: is he looking at the woman ? Li et al, 2019; and visual dialog (Das et al, 2017;Kottur et al, 2018;Agarwal et al, 2020;Wang et al, 2020;Qi et al, 2020). Relations in these tasks are significant for reasoning and understanding the textual and visual information.…”
Section: Introductionmentioning
confidence: 99%
“…(3) Visual co-reference resolution models: CorefNMN (Kottur et al, 2018), RvA . (4) The pretraining model: VDBERT (Wang et al, 2020).…”
Section: Resultsmentioning
confidence: 99%
“…Visual dialogue (Agarwal et al, 2020;Wang et al, 2020;Qi et al, 2020;Murahari et al, 2020) requires agents to give a response on the basis of understanding both visual and textual content. One of the key challenges in visual dialogue is how to solve multimodal co-reference (Das et al, 2017;Kottur et al, 2018).…”
Section: Introductionmentioning
confidence: 99%