2020
DOI: 10.1609/aaai.v34i07.6769
|View full text |Cite
|
Sign up to set email alerts
|

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Abstract: Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could be related to any objects, relationships or semantics. The key challenge in Visual Dialogue task is thus to learn a more comprehensive and semantic-rich image representation which may have adaptive attentions on the image for variant questions. In this research, we propose a novel model to depict an image from … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
37
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 63 publications
(37 citation statements)
references
References 23 publications
(35 reference statements)
0
37
0
Order By: Relevance
“…We compare the results of our model with the results of the following previously published models obtained on the VisDial v1.0 dataset: LF (Das et al, 2017), HRE (Das et al, 2017), MN (Das et al, 2017), CorefNMN (Kottur et al, 2018), FGA (Schwartz et al, 2019), RVA (Niu et al, 2019), HA-CAN (Yang et al, 2019), Synergistic (Guo et al, 2019), DAN (Kang et al, 2019), Dual VD (Jiang et al, 2020), and CAG (Guo et al, 2020). To make a fair and transparent comparison, we do not compare our models with models which were pretrained on other vision-language datasets before finetuning them on the VisDial v1.0 dataset, all the more because the vision-language datasets used in the pretraining overlap with the testset of Visdial v1.0.…”
Section: Quantitative Resultsmentioning
confidence: 95%
“…We compare the results of our model with the results of the following previously published models obtained on the VisDial v1.0 dataset: LF (Das et al, 2017), HRE (Das et al, 2017), MN (Das et al, 2017), CorefNMN (Kottur et al, 2018), FGA (Schwartz et al, 2019), RVA (Niu et al, 2019), HA-CAN (Yang et al, 2019), Synergistic (Guo et al, 2019), DAN (Kang et al, 2019), Dual VD (Jiang et al, 2020), and CAG (Guo et al, 2020). To make a fair and transparent comparison, we do not compare our models with models which were pretrained on other vision-language datasets before finetuning them on the VisDial v1.0 dataset, all the more because the vision-language datasets used in the pretraining overlap with the testset of Visdial v1.0.…”
Section: Quantitative Resultsmentioning
confidence: 95%
“…FGA (Schwartz et al, 2019) realizes a factor graph attention mechanism, which constructs the graph over all the multi-modal features and estimates their interactions. DualVD (Jiang et al, 2020b) constructs a scene graph to represent the image while embedding both relationships provided by (Zhang et al, 2019b) and original object detection features (Anderson et al, 2018). CAG (Guo et al, 2020) focuses on an iterative question-conditioned context-aware graph, including both fine-grained visual-objects and textualhistory semantics.…”
Section: Visual Dialogmentioning
confidence: 99%
“…In this part, we compare the proposed method with some state-ofthe-art methods on VisDial V1.0 and V0.9: VisDial V1.0. First of all, we compare the performance of our approach with other state-of-the-art methods on VisDial v1.0 test set, including LF [3], HRE [3], MN [3], CorefMN [16], RvA [21], DL-61 [10], DVAN [7], DAN [14], VGNN [32], FGA [25], DualVD [13] and KBGN [12]. The reason that we do not compare our method with the BERT models is that they adopt large-scale multimodal data to pre-train, and we only use the VisDial training set to train the model.…”
Section: Comparisons With the State-of-the-artsmentioning
confidence: 99%