A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr, Alane; Zhou, Stephanie; Zhang, Ally; Zhang, Iris; Bai, Huajun; Artzi, Yoav

doi:10.18653/v1/p19-1644

Cited by 289 publications

(265 citation statements)

References 42 publications

Supporting

Mentioning

241

Contrasting

Order By: Relevance

“…We test 3 models that have proved effective in visual reasoning tasks (Johnson et al, 2017a;Suhr et al, 2018;Yi et al, 2018). All models are multi-modal, i.e., they use both a visual representation of the scene and a linguistic representation of the sentence.…”

Section: Modelsmentioning

confidence: 99%

Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts

Pezzelle¹,

Fernández²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

This work aims at modeling how the meaning of gradable adjectives of size ('big', 'small') can be learned from visually-grounded contexts. Inspired by cognitive and linguistic evidence showing that the use of these expressions relies on setting a threshold that is dependent on a specific context, we investigate the ability of multi-modal models in assessing whether an object is 'big' or 'small' in a given visual scene. In contrast with the standard computational approach that simplistically treats gradable adjectives as 'fixed' attributes, we pose the problem as relational: to be successful, a model has to consider the full visual context. By means of four main tasks, we show that state-of-the-art models (but not a relatively strong baseline) can learn the function subtending the meaning of size adjectives, though their performance is found to decrease while moving from simple to more complex tasks. Crucially, models fail in developing abstract representations of gradable adjectives that can be used compositionally.

show abstract

Section: Modelsmentioning

confidence: 99%

Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts

Pezzelle¹,

Fernández²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…FOIL takes a different approach and requires a system to differentiate invalid image descriptions from valid ones (Shekhar et al, 2017). Natural Language Visual Reasoning (NLVR) requires verifying if image descriptions are true (Suhr et al, 2017(Suhr et al, , 2018.…”

Section: Tasks In Vandl Researchmentioning

confidence: 99%

Challenges and Prospects in Vision and Language Research

Kafle

Shrestha

Kanan

2019

Front. Artif. Intell.

View full text Add to dashboard Cite

Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated stateof-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.

show abstract

“…These in turn are exploited by VQA models, which become heavily reliant upon such statistical biases and tendencies within the answer distribution to largely circumvent the need for true visual scene understanding [2,11,15,8]. This situation is exacerbated by the simplicity of many of the questions, from both linguistic and semantic perspectives, which in practice rarely require much beyond object recognition [33]. Consequently, early benchmarks led to an inflated sense of the state of scene understanding, severely diminishing their credibility [37].…”

Section: Introductionmentioning

confidence: 99%

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Hudson

Manning

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

589

302

View full text Add to dashboard Cite

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.

show abstract

A Corpus for Reasoning about Natural Language Grounded in Photographs

Cited by 289 publications

References 42 publications

Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts

Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts

Challenges and Prospects in Vision and Language Research

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

Contact Info

Product

Resources

About