“…We compare the results of our model with the results of the following previously published models obtained on the VisDial v1.0 dataset: LF (Das et al, 2017), HRE (Das et al, 2017), MN (Das et al, 2017), CorefNMN (Kottur et al, 2018), FGA (Schwartz et al, 2019), RVA (Niu et al, 2019), HA-CAN (Yang et al, 2019), Synergistic (Guo et al, 2019), DAN (Kang et al, 2019), Dual VD (Jiang et al, 2020), and CAG (Guo et al, 2020). To make a fair and transparent comparison, we do not compare our models with models which were pretrained on other vision-language datasets before finetuning them on the VisDial v1.0 dataset, all the more because the vision-language datasets used in the pretraining overlap with the testset of Visdial v1.0.…”