CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions

Liu, Runtao; Liu, Chenxi; Bai, Yang; Yuille, Alan

doi:10.1109/cvpr.2019.00431

Cited by 89 publications

(80 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the VQA task, we evaluate on the GQA dataset [17] and the CLEVR dataset [18], which both require resolving relations between objects. For the REF task, we evaluate on the CLEVR-Ref+ dataset [24]. In particular, the CLEVR and CLEVR-Ref+ datasets contain many complicated questions or expressions with higher-order relations, such as the ball on the left of the object behind a blue cylinder.…”

Section: Methodsmentioning

confidence: 99%

“…In these tasks, we replace the local appearancebased visual representations with the context-aware representations from our LCGN model, and demonstrate that our context-aware scene representations can be used as inputs to perform complex reasoning via simple task-specific approaches, with a consistent improvement over the local appearance features across different tasks and datasets. We obtain state-of-the-art results on the GQA dataset [17] for VQA and the CLEVR-Ref+ dataset [24] for REF.…”

Section: Answer: Yesmentioning

confidence: 99%

“…However, many of the expressions in these datasets do not require resolving relations. Recently, a new CLEVR-Ref+ dataset [24] has been proposed for REF. It is built using the CLEVR environment and involves very complex queries, aiming to assess the reasoning capabilities of existing models and find their limitations.…”

Section: Related Workmentioning

confidence: 99%

“…We experiment with the CLEVR-Ref+ dataset [24], which contains similar images as in the CLEVR dataset [18] for VQA and complicated referring expressions requiring relation resolution. On the CLEVR-Ref+ dataset, we evaluate with the bounding box detection task in [24], where the output is a bounding box of the target object and there is only one single target object described by the expression. A localization is consider correct if it overlaps with the ground-truth box with at least 50% IoU.…”

Section: Referring Expression Comprehension (Ref)mentioning

confidence: 99%

See 3 more Smart Citations

Language-Conditioned Graph Networks for Relational Reasoning

Rohrbach

Darrell

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

147

View full text Add to dashboard Cite

Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question "What color is the mug on the plate?" we must check the color of the specific mug that satisfies the "on" relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Answer: Yesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Referring Expression Comprehension (Ref)mentioning

confidence: 99%

See 2 more Smart Citations

Language-Conditioned Graph Networks for Relational Reasoning

Rohrbach

Darrell

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

147

View full text Add to dashboard Cite

show abstract

“…We choose CLEVR, inspired by many works that use it to build diagnostic datasets for various vision and language tasks, e.g. visual question answering [26], referring expression comprehension [22,34], text-to-image generation [13] or visual dialog [33]. As Change Captioning is an emerging task we believe our dataset can complement existing datasets, e.g.…”

Section: Clevr-change Datasetmentioning

confidence: 99%

Robust Change Captioning

Park

Darrell

Rohrbach

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

106

View full text Add to dashboard Cite

Describing what has changed in a scene can be useful to a user, but only if generated text focuses on what is semantically relevant. It is thus important to distinguish distractors (e.g. a viewpoint change) from relevant changes (e.g. an object has moved). We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning. Our model learns to distinguish distractors from semantic changes, localize the changes via Dual Attention over "before" and "after" images, and accurately describe them in natural language via Dynamic Speaker, by adaptively focusing on the necessary visual inputs (e.g. "before" or "after" image). To study the problem in depth, we collect a CLEVR-Change dataset, built off the CLEVR engine, with 5 types of scene changes. We benchmark a number of baselines on our dataset, and systematically study different change types and robustness to distractors. We show the superiority of our DUDA model in terms of both change captioning and localization. We also show that our approach is general, obtaining state-of-the-art results on the recent realistic Spot-the-Diff dataset which has no distractors.

show abstract

Graph Edit Distance Reward: Learning to Edit Scene Graph

Chen

Lin

Wang

et al. 2020

Computer Vision – ECCV 2020

View full text Add to dashboard Cite

Scene Graph, as a vital tool to bridge the gap between language domain and image domain, has been widely adopted in the crossmodality task like VQA. In this paper, we propose a new method to edit the scene graph according to the user instructions, which has never been explored. To be specific, in order to learn editing scene graphs as the semantics given by texts, we propose a Graph Edit Distance Reward, which is based on the Policy Gradient and Graph Matching algorithm, to optimize neural symbolic model. In the context of text-editing image retrieval, we validate the effectiveness of our method in CSS and CRIR dataset. Besides, CRIR is a new synthetic dataset generated by us, which we will publish it soon for future use.

show abstract

CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions

Cited by 89 publications

References 33 publications

Language-Conditioned Graph Networks for Relational Reasoning

Language-Conditioned Graph Networks for Relational Reasoning

Robust Change Captioning

Graph Edit Distance Reward: Learning to Edit Scene Graph

Contact Info

Product

Resources

About