Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Chen, Feilong; Meng, Fandong; Chen, Xiuyi; Zhou, Jie

doi:10.18653/v1/2021.findings-acl.38

Cited by 14 publications

(8 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparison with state-of-the-art. We compare GST with the state-of-the-art approaches on the validation set of the VisDial v1.0 and v0.9 datasets, consisting of UTC [23], MITVG [19], VD-BERT [22], LTMI [18], KBGN [17], DAM [16], ReDAN [12], DMRM [15], Primary [11], RvA [9], CorefNMN [8], CoAtt [7], HCIAE [5], and MN [1]. We decide to use the validation splits since all previous studies benchmarked the models on those splits.…”

Section: Quantitative Results and Analysismentioning

confidence: 99%

See 1 more Smart Citation

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Kang¹,

Kim²,

Kim³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M → 12.9M QA data). For robust training of the generated dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new stateof-the-art results on both datasets. We further observe strong performance gains in the low-data regime (up to 9.35 absolute points on NDCG). * Equal contribution Preprint. Under review.

show abstract

Section: Quantitative Results and Analysismentioning

confidence: 99%

“…Most of the previous approaches in VisDial [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] have trained the dialog agents solely on VisDial data via supervised learning. More recent studies [21][22][23] have employed self-supervised pre-trained models such as BERT [24] or ViLBERT [25] and finetuned them on VisDial data.…”

Section: Introductionmentioning

confidence: 99%

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Kang¹,

Kim²,

Kim³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Niu et al [120] Selectively referring dialogue history to refine the visual attention until referencing the answer. Chen et al [11] Establishing mapping of visual object and textual entities to exclude undesired visual content.…”

Section: Visual Reference Resolutionmentioning

confidence: 99%

“…The above works all implicitly attend to spatial or object-level image features, which will be inevitably distracted by unnecessary visual content. To address this, Chen et al [11] establish specific mapping of objects in the image and textual entities in the input query and dialogue history, to exclude undesired visual content and reduce attention noise. Additionally, the multimodal incremental transformer integrates visual information and dialogue context to generate visually and contextually coherent responses.…”

Section: Unique Training Schemesbased Vadmentioning

confidence: 99%

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Wang¹,

Guo²,

Zeng³

et al. 2022

Preprint

View full text Add to dashboard Cite

The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual-Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.CCS Concepts: • Human-centered computing → HCI theory, concepts and models; • Computing methodologies → Discourse, dialogue and pragmatics.

show abstract

“…Recently, with the rise of pre-trained models [2], researchers have begun to explore vision-and-language task [3,4,5] with pre-trained models [1]. Specifically, visual dialog [6,7,8,9], which aims to hold a meaningful conversation with a human about a given image, is a challenging task that requires models have sufficient cross-modal understanding based on both visual and textual context to answer the current question.…”

Section: Introductionmentioning

confidence: 99%

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Chen¹,

Chen²,

Xu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's cross-modal understanding to handle a multi-turn visually-grounded conversation. Experiments show that the proposed approach improves the visual dialog model's cross-modal understanding and brings satisfactory gain to the Vis-Dial dataset.

show abstract

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Cited by 14 publications

References 29 publications

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Contact Info

Product

Resources

About