“…WinoX [39] French, German, Russian [83] answering 150,000 questions COFAR [47] Find an image 25,300 images Expert construction matching a query 40,800 queries CoSim [74] Counterfactual 3500 instances Crowd sourcing reasoning about images CRIC [44] Compositional 96,000 images Synthesized reasoning 494,000 questions e-SNLI-VE [71] Visual-textual 430,000 Synthesized entailment from SNLI-VE FVQA [138] Visual question 2190 images Synthesized answering GD-VCR [149] Visual question 328 images Expert construction answering 886 Q/A pairs Half&Half [123] Reasoning with text 126,000 examples Synthesized and incomplete images HumanCog [151] Who in image 67,000 images Extracted from VCR is being described? 138,000 descriptions + crowd sourcing HVQR [21] Visual question 33,000 images Synthesized answering 157,000 Q/A pairs IconQA [94] Visual question 107,400 instances Crowd sourcing answering KB-VQA [137] Visual question 2190 images Synthesized answering Naive action-Match image 1400 text effects Crowd sourcing effect prediction [45] to effect of action 4163 images PTR [61] Visual question 80,000 images Synthesized (both answering 800,000 images images and Q/A pairs) Sherlock [60] Inferences from 103,000 images Crowd sourcing images 363,000 inferences VCR [155] Visual question 290,000 questions Crowd sourcing answering Visual Visual question 108,000 images Crowd sourcing Genome [78] answering WinoGAViL [13] Match image to text 4482 examples Gamification Table 8: Image benchmarks Name Task Size Construction AGENT [121] Is this surprising?…”