“…• Non-textual Modality: For images, new datasets are curated for a variety of tasks including caption relevance (Suhr et al, 2019), multimodal MT (Zhou et al, 2018c), soccer commentaries (Koncel-Kedziorski et al, 2014 semantic role labeling (Silberer and Pinkal, 2018), instruction following (Han and Schlangen, 2017), navigation (Andreas and Klein, 2014), understanding physical causality of actions , understanding topological spatial expressions (Kelleher et al, 2006), spoken image captioning , entail-ment (Vu et al, 2018), image search (Kiros et al, 2018), scene generation (Chang et al, 2015), etc., Coming to videos, datasets have become popular for several tasks like identifying action segments (Regneri et al, 2013), sematic parsing (Ross et al, 2018), instruction following from visual demonstration , spatio-temporal question answering (Lei et al, 2020), etc.,…”