2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00179
|View full text |Cite
|
Sign up to set email alerts
|

TransVG: End-to-End Visual Grounding with Transformers

Abstract: In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
117
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 189 publications
(161 citation statements)
references
References 77 publications
0
117
0
Order By: Relevance
“…To make the image size uniform across tasks (see Table 1), we adopt the LOC task's image size of 640 [21] as a middle ground. This is larger but comparable to the image size of REC task [89,88,13]. It is smaller than the size of DET task's images [23] which might limit performance on smaller objects.…”
Section: Task Unification and Multi-task Learningmentioning
confidence: 92%
See 4 more Smart Citations
“…To make the image size uniform across tasks (see Table 1), we adopt the LOC task's image size of 640 [21] as a middle ground. This is larger but comparable to the image size of REC task [89,88,13]. It is smaller than the size of DET task's images [23] which might limit performance on smaller objects.…”
Section: Task Unification and Multi-task Learningmentioning
confidence: 92%
“…Concretely, we utilize the more expressive cross-attention fusion on lower resolution features, and the more efficient product fusion on higher resolution features to combine the best of both worlds. Last but not least, we discover that a standard object detector and detection losses [69] are sufficient and surprisingly effective for REC, LOC, and DET tasks without a need for task-specific design and losses [13,21,51,55,88,89,91]. In short, FindIt is a simple, efficient, and end-to-end trainable model for unified visual grounding and object detection.…”
Section: Introductionmentioning
confidence: 89%
See 3 more Smart Citations