Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog

Chen, Feilong; Zhang, Duzhen; Chen, Xiuyi; Shi, Jing; Xu, Shuang; Xu, Bo

doi:10.1145/3503161.3547776

Cited by 6 publications

(1 citation statement)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This dual focus has enriched the understanding of image-text dynamics. Furthermore, the alignment-based approach model (Chen et al 2022) has shown promise in explicitly aligning visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment. Another intriguing approach (Chen et al 2021;Guo et al 2020;Zhang et al 2022b;Zheng et al 2019) is the graph-based representation suitable for the composite scenario of dialog history and image, which offers a structured way to understand relationships within an image.…”

Section: Visual Dialogmentioning

confidence: 99%

Structure-Aware Multimodal Sequential Learning for Visual Dialog

Kim,

et al. 2024

AAAI

View full text Add to dashboard Cite

With the ability to collect vast amounts of image and natural language data from the web, there has been a remarkable advancement in Large-scale Language Models (LLMs). This progress has led to the emergence of chatbots and dialogue systems capable of fluent conversations with humans. As the variety of devices enabling interactions between humans and agents expands, and the performance of text-based dialogue systems improves, there has been recently proposed research on visual dialog. However, visual dialog requires understanding sequences of pairs consisting of images and sentences, making it challenging to gather sufficient data for training large-scale models from the web. In this paper, we propose a new multimodal learning method leveraging existing large-scale models designed for each modality, to enable model training for visual dialog with small visual dialog datasets. The key ideas of our approach are: 1) storing the history or context during the progression of visual dialog in the form of spatiotemporal graphs, and 2) introducing small modulation blocks between modality-specific models and the graphs to align the semantic spaces. For implementation, we introduce a novel structure-aware cross-attention method, which retrieves relevant image and text knowledge for utterance generation from the pretrained models. For experiments, we achieved a new state-of-the-art performance on three visual dialog datasets, including the most challenging one COMET.

show abstract