2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00431
|View full text |Cite
|
Sign up to set email alerts
|

CLEVR-Ref+: Diagnosing Visual Reasoning With Referring Expressions

Abstract: Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
72
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 89 publications
(80 citation statements)
references
References 33 publications
0
72
0
Order By: Relevance
“…For the VQA task, we evaluate on the GQA dataset [17] and the CLEVR dataset [18], which both require resolving relations between objects. For the REF task, we evaluate on the CLEVR-Ref+ dataset [24]. In particular, the CLEVR and CLEVR-Ref+ datasets contain many complicated questions or expressions with higher-order relations, such as the ball on the left of the object behind a blue cylinder.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…For the VQA task, we evaluate on the GQA dataset [17] and the CLEVR dataset [18], which both require resolving relations between objects. For the REF task, we evaluate on the CLEVR-Ref+ dataset [24]. In particular, the CLEVR and CLEVR-Ref+ datasets contain many complicated questions or expressions with higher-order relations, such as the ball on the left of the object behind a blue cylinder.…”
Section: Methodsmentioning
confidence: 99%
“…In these tasks, we replace the local appearancebased visual representations with the context-aware representations from our LCGN model, and demonstrate that our context-aware scene representations can be used as inputs to perform complex reasoning via simple task-specific approaches, with a consistent improvement over the local appearance features across different tasks and datasets. We obtain state-of-the-art results on the GQA dataset [17] for VQA and the CLEVR-Ref+ dataset [24] for REF.…”
Section: Answer: Yesmentioning
confidence: 99%
See 2 more Smart Citations
“…We choose CLEVR, inspired by many works that use it to build diagnostic datasets for various vision and language tasks, e.g. visual question answering [26], referring expression comprehension [22,34], text-to-image generation [13] or visual dialog [33]. As Change Captioning is an emerging task we believe our dataset can complement existing datasets, e.g.…”
Section: Clevr-change Datasetmentioning
confidence: 99%