DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Jiang, Xiaoze; Yu, Jing; Qin, Zengchang; Zhuang, Yongzhi; Zhang, Xingxing; Hu, Yue; Wu, Qi

doi:10.1609/aaai.v34i07.6769

Cited by 63 publications

(37 citation statements)

References 23 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare the results of our model with the results of the following previously published models obtained on the VisDial v1.0 dataset: LF (Das et al, 2017), HRE (Das et al, 2017), MN (Das et al, 2017), CorefNMN (Kottur et al, 2018), FGA (Schwartz et al, 2019), RVA (Niu et al, 2019), HA-CAN (Yang et al, 2019), Synergistic (Guo et al, 2019), DAN (Kang et al, 2019), Dual VD (Jiang et al, 2020), and CAG (Guo et al, 2020). To make a fair and transparent comparison, we do not compare our models with models which were pretrained on other vision-language datasets before finetuning them on the VisDial v1.0 dataset, all the more because the vision-language datasets used in the pretraining overlap with the testset of Visdial v1.0.…”

Section: Quantitative Resultsmentioning

confidence: 95%

Modeling Coreference Relations in Visual Dialog

Li¹,

Moens²

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. Most previous works have focused on learning better multi-modal representations or on exploring different ways of fusing visual and language features, while the coreferences in the dialog are mainly ignored. In this paper, based on linguistic knowledge and discourse features of human dialog we propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way. Experimental results on the VisDial v1.0 dataset shows that our model, which integrates two novel and linguistically inspired soft constraints in a deep transformer neural architecture, obtains new state-of-the-art performance in terms of recall at 1 and other evaluation metrics compared to current existing models and this without pretraining on other visionlanguage datasets. Our qualitative results also demonstrate the effectiveness of the method that we propose. 1

show abstract

Section: Quantitative Resultsmentioning

confidence: 95%

Modeling Coreference Relations in Visual Dialog

Li¹,

Moens²

2021

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

View full text Add to dashboard Cite

show abstract

“…FGA (Schwartz et al, 2019) realizes a factor graph attention mechanism, which constructs the graph over all the multi-modal features and estimates their interactions. DualVD (Jiang et al, 2020b) constructs a scene graph to represent the image while embedding both relationships provided by (Zhang et al, 2019b) and original object detection features (Anderson et al, 2018). CAG (Guo et al, 2020) focuses on an iterative question-conditioned context-aware graph, including both fine-grained visual-objects and textualhistory semantics.…”

Section: Visual Dialogmentioning

confidence: 99%

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Chen¹,

Chen²,

Meng³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the representation of the image based on the fully represented question. Therefore, we propose a novel relation-aware graph-over-graph network (GoG) for visual dialog. Specifically, GoG consists of three sequential graphs: 1) H-Graph, which aims to capture coreference relations among dialog history; 2) History-aware Q-Graph, which aims to fully understand the question through capturing dependency relations between words based on coreference resolution on the dialog history; and 3) Questionaware I-Graph, which aims to capture the relations between objects in an image based on fully question representation. As an additional feature representation module, we add GoG to the existing visual dialogue model. Experimental results show that our model outperforms the strong baseline in both generative and discriminative settings by a significant margin.

show abstract

“…In this part, we compare the proposed method with some state-ofthe-art methods on VisDial V1.0 and V0.9: VisDial V1.0. First of all, we compare the performance of our approach with other state-of-the-art methods on VisDial v1.0 test set, including LF [3], HRE [3], MN [3], CorefMN [16], RvA [21], DL-61 [10], DVAN [7], DAN [14], VGNN [32], FGA [25], DualVD [13] and KBGN [12]. The reason that we do not compare our method with the BERT models is that they adopt large-scale multimodal data to pre-train, and we only use the VisDial training set to train the model.…”

Section: Comparisons With the State-of-the-artsmentioning

confidence: 99%

Exploring Contextual-Aware Representation and Linguistic-Diverse Expression for Visual Dialog

Gao

Zhao

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Visual dialog is a fundamental vision-language task where an AI agent holds a meaningful dialogue about visual content with humans in nature. However, this task remains challenging, since there is still no consensus way to capture rich visual contextual information contained in the environment rather than only focusing on visual objects. Furthermore, conventional methods suffer from the single-answer learning strategy, where it only accepts one correct answer without considering the diverse expressions of the language (i.e., one identical meaning but multiple expressions via rephrasing or adopting synonyms etc). In this paper, we introduce Contextual-Aware Representation and linguistic-diverse Expression (CARE), a novel plug-and-play framework with contextual-based graph embedding and curriculum contrastive learning to solve the above two issues. Specifically, the contextual-based graph embedding (CGE) module aims to integrate the environmental context information with visual objects to improve the answer quality. In addition, we propose a curriculum contrastive learning (CCL) paradigm to imitate the learning habits of humans when facing a question with multiple correct answers sharing the same meaning but with diverse expressions. To support CCL, a CCL loss is designed to progressively strengthen the model's ability in identifying the answers with correct semantics. Extensive experiments are conducted on two benchmark datasets, and our proposed method outperforms the state-of-the-arts by a considerable margin on VisDial V1.0 (4.63%

show abstract

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Cited by 63 publications

References 23 publications

Modeling Coreference Relations in Visual Dialog

Modeling Coreference Relations in Visual Dialog

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Exploring Contextual-Aware Representation and Linguistic-Diverse Expression for Visual Dialog

Contact Info

Product

Resources

About