2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00478
|View full text |Cite
|
Sign up to set email alerts
|

A Fast and Accurate One-Stage Approach to Visual Grounding

Abstract: We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight. The performances of existing propose-and-rank twostage methods are capped by the quality of the region candidates they propose in the first stage -if none of the candidates could cover the ground truth region, there is no hope in the second stage to rank the right region to the top. To avoid this caveat, we propose a one-stage model that enables end-to-end joint optimization. The main idea is as s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
286
0

Year Published

2020
2020
2020
2020

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 287 publications
(309 citation statements)
references
References 47 publications
1
286
0
Order By: Relevance
“…The task of referring expression comprehension has attracted increasing attention in recent years, which expects to locate corresponding objects within an image based on input expressions. Previous referring expression comprehension methods [2, 6, 11, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-51] can be mainly divided into two types, including proposal-region-based [2,6,11,22,25,27,28,[41][42][43][47][48][49][50][51] and grid-region-based methods [16,35,45] .…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The task of referring expression comprehension has attracted increasing attention in recent years, which expects to locate corresponding objects within an image based on input expressions. Previous referring expression comprehension methods [2, 6, 11, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-51] can be mainly divided into two types, including proposal-region-based [2,6,11,22,25,27,28,[41][42][43][47][48][49][50][51] and grid-region-based methods [16,35,45] .…”
Section: Related Workmentioning
confidence: 99%
“…Grid-region-based methods [16,35,45] usually fuse the language features with grid region features, and then leverage one-stage object detectors (e.g. YOLOv3 [33]) to directly localize the object corresponding to the input expression.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Bajaj et al [3] achieved significant improvement by using Gated Graph Neural Networks to formulate the dependency among phrases and image regions. Yang et al [51] proposed a one-stage approach, fusing text query embeddings into the YOLOv3 object detector while augmenting by using spatial features. Lai et al [29] proposed to use transformers to capture contextual representations for text tokens and image regions.…”
Section: Related Workmentioning
confidence: 99%
“…description sentence, for example "break the eggs", visual grounding aims at localizing the query objects described in the sentence on the given image or video. Recently, great progress has been made on image grounding [4,15,30,31]. On the basis of this, researchers started to explore grounding in the video domain [5,7,14,25,35].…”
mentioning
confidence: 99%