“…Indeed, we are not the first to make these assumptions. Visual question answering models have already been used to explore neural networks' capacity to learn meaningful representations of referential words, such as nouns and predicates when trained on language tasks grounded in the visual world (Jiang et al, 2023;Mao, Gan, Kohli, Tenenbaum, & Wu, 2019;Pillai, Matuszek, & Ferraro, 2021;Wang, Mao, Gershman, & Wu, 2021;Zellers et al, 2021). As for function words, Hill, Hermann, Blunsom, and Clark (2018) briefly consider how visually grounded models learn negation, and Kuhnle and Copestake (2019) studied how these models interpret the quantifier most.…”