“…Driven by VQA, several datasets have been proposed to minimize the bias observed in natural images (Goyal et al, 2017;Ray et al, 2019); to force models to "reason" over a joint visual and linguistic input Suhr et al, 2019); to deal with objects' attributes and relations (Krishna et al, 2017); to encompass more diverse (Zhu et al, 2016) and goal-oriented questions and answers (Gurari et al, 2018). At the same time, some work proposed higher-level evaluations of VQA models and showed their limitations (Hodosh and Hockenmaier, 2016;Shekhar et al, 2017); similarly, recent attention has been paid to understand what makes a question "difficult" for a model (Bhattacharya et al, 2019;Terao et al, 2020). Despite impressive progress, current approaches to VQA do not tackle one crucial limitation of the task: the answer to a question is given by the alignment of language and vision rather than their complementary integration.…”