Tell Me More: A Dataset of Visual Scene Description Sequences

Ilinykh, Nikolai; Zarrieß, Sina; Schlangen, David

doi:10.18653/v1/w19-8621

Cited by 20 publications

(22 citation statements)

References 18 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Visual Dialogues have been the aim of early work on natural language understanding (NLU) (Winograd, 1972) and are now studied by a very active community at the interplay between computer vision and computational linguistics (e.g. Baldridge et al (2018); Ilinykh et al (2019); Haber et al (2019)). Recently, important progress has been made on visual dialogue systems thanks to the release of datasets like Vis-Dial (Das et al, 2017) and GuessWhat?!…”

Section: Introductionmentioning

confidence: 99%

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies

Testoni¹,

Greco²,

Bianchi³

et al. 2020

Proceedings of the Third International Workshop on Spatial Language Understanding

View full text Add to dashboard Cite

In this paper, we study the grounding skills required to answer spatial questions asked by humans while playing the GuessWhat?! game. We propose a classification for spatial questions dividing them into absolute, relational, and group questions. We build a new answerer model based on the LXMERT multimodal transformer and we compare a baseline with and without visual features of the scene. We are interested in studying how the attention mechanisms of LXMERT are used to answer spatial questions since they require putting attention on more than one region simultaneously and spotting the relation holding among them. We show that our proposed model outperforms the baseline by a large extent (9.70% on spatial questions and 6.27% overall). By analyzing LXMERT errors and its attention mechanisms, we find that our classification helps to gain a better understanding of the skills required to answer different spatial questions.

show abstract

Section: Introductionmentioning

confidence: 99%

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies

Testoni¹,

Greco²,

Bianchi³

et al. 2020

Proceedings of the Third International Workshop on Spatial Language Understanding

View full text Add to dashboard Cite

show abstract

“…Visual Dialogues have a long tradition (e.g., Anderson et al, 1991 ). They can be chit-chat (e.g., Das et al, 2017 ) or task-oriented (e.g., de Vries et al, 2017 ; Haber et al, 2019 ; Ilinykh et al, 2019a , b ). Task-oriented dialogues are easier to evaluate since their performance can be judged in terms of their task-success, hence we focus on this type of dialogues which can be further divided as following: the two agents can have access to the same visual information (de Vries et al, 2017 ), share only part of it Haber et al ( 2019 ) and Ilinykh et al ( 2019a ) or only one agent has access to the image (Chattopadhyay et al, 2017 ).…”

Section: Introductionmentioning

confidence: 99%

Artificial Intelligence Models Do Not Ground Negation, Humans Do. GuessWhat?! Dialogues as a Case Study

2022

View full text Add to dashboard Cite

Negation is widely present in human communication, yet it is largely neglected in the research on conversational agents based on neural network architectures. Cognitive studies show that a supportive visual context makes the processing of negation easier. We take GuessWhat?!, a referential visually grounded guessing game, as test-bed and evaluate to which extent guessers based on pre-trained language models profit from negatively answered polar questions. Moreover, to get a better grasp of models' results, we select a controlled sample of games and run a crowdsourcing experiment with subjects. We evaluate models and humans against the same settings and use the comparison to better interpret the models' results. We show that while humans profit from negatively answered questions to solve the task, models struggle in grounding negation, and some of them barely use it; however, when the language signal is poorly informative, visual features help encoding the negative information. Finally, the experiments with human subjects put us in the position of comparing humans and models' predictions and get a grasp about which models make errors that are more human-like and as such more plausible.

show abstract

“…Instead, the reference is guided by visual attention. We present a linguistic perspective on these challenges by analysing a pilot annotation of two situated dialogue corpora: the Cups corpus (Dobnik et al, 2020) and the Tell-me-more corpus (Ilinykh et al, 2019), shown below in Figure 1 and example (1) respectively. Starting from the annotation scheme for several textual coreference datasets (Artstein and Poesio, 2006;Pradhan et al, 2007;Uryupina et al, 2019), this exercise proved useful to pinpoint in what ways the purely textual doc-ument scenario is different from the domain of embodied interaction.…”

Section: Introductionmentioning

confidence: 99%

Reference and coreference in situated dialogue

Loáiciga¹,

Dobnik²,

Schlangen³

2021

Proceedings of the Second Workshop on Advances in Language and Vision Research

Self Cite

View full text Add to dashboard Cite

In recent years, a large number of corpora have been developed for vision and language tasks. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. Working with two corpora approaching this goal, we present a linguistic perspective on some of the challenges in creating and extending resources combining language and vision while preserving continuity with the existing best practices in the area of coreference annotation.

show abstract

Tell Me More: A Dataset of Visual Scene Description Sequences

Cited by 20 publications

References 18 publications

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies

They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies

Artificial Intelligence Models Do Not Ground Negation, Humans Do. GuessWhat?! Dialogues as a Case Study

Reference and coreference in situated dialogue

Contact Info

Product

Resources

About