“…Visual concept learning. Learning visual concepts from language and other forms of supervision provides useful representations for various downstream tasks, such as image captioning (Yin and Ordonez, 2017;Wang et al, 2018), visual-question answering (Yi et al, 2018;Huang et al, 2019), shape differentiation (Achlioptas et al, 2019), image classification (Mu et al, 2020), and scene manipulation (Prabhudesai et al, 2020). Previous work has been focusing on various types of representations (Ren et al, 2016;Wu et al, 2017), training algorithms (Faghri et al, 2018;Morgado et al, 2020) and supervision (Johnson et al, 2016;Yang et al, 2018).…”