Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Kang, Gi-Cheon; Lim, Jaeseo; Zhang, Byoung-Tak

doi:10.48550/arxiv.1902.09368

Cited by 9 publications

(10 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared to results of the Visual Dialog challenge 2019, our models also show strong results. Although ReDAN+ (Gan et al 2019) and MReaL-BDAI show higher NDCG scores, our consensus dropout fusion model shows more balanced results over metrics while still having a competitive NDCG score compared to DAN (Kang, Lim, and Zhang 2019), with rank 3 based on NDCG metric and high balance rank based on metric average. 4…”

Section: Final Visual Dialog Test Resultsmentioning

confidence: 78%

Modality-Balanced Models for Visual Dialogue

Kim

Tan

Bansal

2020

Preprint

View full text Add to dashboard Cite

The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversation context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history (e.g., by extracting certain keywords or patterns in the context information), whereas image-only models are more generalizable (because they cannot memorize or extract keywords from history) and perform substantially better at the primary normalized discounted cumulative gain (NDCG) task metric which allows multiple correct answers. Hence, this observation encourages us to explicitly maintain two models, i.e., an image-only model and an image-history joint model, and combine their complementary abilities for a more balanced multimodal model. We present multiple methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters. Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high balance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics.

show abstract

Section: Final Visual Dialog Test Resultsmentioning

confidence: 78%

Modality-Balanced Models for Visual Dialogue

Kim

Tan

Bansal

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Multimodal models have proven their ability of modeling interactions between different modalities and better undertanding the semantics behind textual utterances [13,19,32,51,56,58,64,65,78], and pretraining on additional data gives further performance boosts for a variety of established visionand-language tasks, such as visual question answering [3,46,52], visual commonsense reasoning [37,43] and text-to-image generation [49,59,60]. However, these works focus on the QA style visual dialog, rather than the conversation style with which we are more concerned.…”

Section: Jointly Modeling Visual and Textual Informationmentioning

confidence: 99%

Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation

Wang¹,

Meng²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-modal dialog modeling is of growing interest. In this work, we propose frameworks to resolve a specific case of multi-modal dialog generation that better mimics multi-modal dialog generation in the real world, where each dialog turn is associated with the visual context in which it takes place. Specifically, we propose to model the mutual dependency between text-visual features, where the model not only needs to learn the probability of generating the next dialog utterance given preceding dialog utterances and visual contexts, but also the probability of predicting the visual features in which a dialog utterance takes place, leading the generated dialog utterance specific to the visual context. We observe significant performance boosts over vanilla models when the mutual dependency between text and visual features is modeled. 1

show abstract

“…Visual Dialog Generation Most of existing works apply attention mechanisms to model the interplay between text and visual contexts (Lu et al, 2017;Kottur et al, 2018;Jiang and Bansal, 2019;Yang et al, 2019;Guo et al, 2019;Niu et al, 2019;Kang et al, 2019;Park et al, 2020;Jiang et al, 2020b). Other techniques like reinforcement learning (Das et al, 2017b;Wu et al, 2018), variational auto-encoders (Massiceti et al, 2018) and graph networks Jiang et al, 2020a) have also been employed to the visual dialog task.…”

Section: Dialog Generationmentioning

confidence: 99%

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

Wang¹,

Meng²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

In order to better simulate the real human conversation process, models need to generate dialogue utterances based on not only preceding textual contexts but also visual contexts. However, with the development of multi-modal dialogue learning, the dataset scale gradually becomes a bottleneck. In this report, we release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset compared to the previous version OpenViDial 1.0 (Meng et al., 2020). OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series from different resources, and each dialogue turn is paired with its corresponding visual context. We hope this large-scale dataset can help facilitate future researches on open-domain multi-modal dialog generation, e.g., multi-modal pretraining for dialogue generation. 1

show abstract

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Cited by 9 publications

References 12 publications

Modality-Balanced Models for Visual Dialogue

Modality-Balanced Models for Visual Dialogue

Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

Contact Info

Product

Resources

About