VD-BERT: A Unified Vision and Dialog Transformer with BERT

Wang, Yue; Joty, Shafiq; Lyu, Michael R.; King, Irwin; Xiong, Caiming; Hoi, Steven C. H.

doi:10.18653/v1/2020.emnlp-main.269

Cited by 56 publications

(55 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While multi-head attention has been widely exploited in many vision-language (VL) tasks, such as image captioning (Zhou et al, 2020), visual question answering (Tan and Bansal, 2019;, and visual dialog (Kang et al, 2019;Wang et al, 2020), its potential benefit to model flexible cross-media posts has been previously ignored. Due to the informal style in social media, cross-media keyphrase prediction brings unique difficulties mainly in two aspects: first, its textimage relationship is rather complicated (Vempala and Preotiuc-Pietro, 2019) while in conventional VL tasks the two modalities have most semantics shared; second, social media images usually exhibit a more diverse distribution and a much higher probability of containing OCR tokens ( §4), thereby posing a hurdle for effectively processing.…”

Section: Related Workmentioning

confidence: 99%

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Wang

Lyu

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Social media produces large amounts of contents every day. To help users quickly capture what they need, keyphrase prediction is receiving a growing attention. Nevertheless, most prior efforts focus on text modeling, largely ignoring the rich features embedded in the matching images. In this work, we explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. To better align social media style texts and images, we propose: (1) a novel Multi-Modality Multi-Head Attention (M 3 H-Att) to capture the intricate cross-media interactions; (2) image wordings, in forms of optical characters and image attributes, to bridge the two modalities. Moreover, we design a novel unified framework to leverage the outputs of keyphrase classification and generation and couple their advantages. Extensive experiments on a large-scale dataset 1 newly collected from Twitter show that our model significantly outperforms the previous state of the art based on traditional co-attentions. Further analyses show that our multi-head attention is able to attend information from various aspects and boost classification or generation in diverse scenarios.

show abstract

Section: Related Workmentioning

confidence: 99%

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Wang

Lyu

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our method improves a lot based on the LTMI (Nguyen et al, 2020). As shown in Table 3, our approach outperforms VDBERT (Wang et al, 2020) ‡ which trains from scratch without extra datasets. All the comparison show that our approach is valid due to explicit relation modeling.…”

Section: Generative Resultsmentioning

confidence: 90%

“…Yes Q4: is he looking at the woman ? Li et al, 2019; and visual dialog (Das et al, 2017;Kottur et al, 2018;Agarwal et al, 2020;Wang et al, 2020;Qi et al, 2020). Relations in these tasks are significant for reasoning and understanding the textual and visual information.…”

Section: Introductionmentioning

confidence: 99%

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Chen¹,

Chen²,

Meng³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the representation of the image based on the fully represented question. Therefore, we propose a novel relation-aware graph-over-graph network (GoG) for visual dialog. Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Questionaware I-Graph, which aims to capture the relations between objects in an image based on fully question representation. As an additional feature representation module, we add GoG to the existing visual dialogue model. Experimental results show that our model outperforms the strong baseline in both generative and discriminative settings by a significant margin.

show abstract

“…(3) Visual co-reference resolution models: CorefNMN (Kottur et al, 2018), RvA . (4) The pretraining model: VDBERT (Wang et al, 2020).…”

Section: Resultsmentioning

confidence: 99%

“…Visual dialogue (Agarwal et al, 2020;Wang et al, 2020;Qi et al, 2020;Murahari et al, 2020) requires agents to give a response on the basis of understanding both visual and textual content. One of the key challenges in visual dialogue is how to solve multimodal co-reference (Das et al, 2017;Kottur et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Chen¹,

Meng²,

Chen³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal coreference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a Multimodal Incremental Transformer with Visual Grounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

show abstract

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Cited by 56 publications

References 35 publications

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Contact Info

Product

Resources

About