TransVG: End-to-End Visual Grounding with Transformers

Deng, Jiajun; Yang, Zhengyuan; Chen, Tianlang; Zhou, Wengang; Li, Houqiang

doi:10.1109/iccv48922.2021.00179

Cited by 189 publications

(161 citation statements)

References 77 publications

Supporting

Mentioning

117

Contrasting

Order By: Relevance

“…To make the image size uniform across tasks (see Table 1), we adopt the LOC task's image size of 640 [21] as a middle ground. This is larger but comparable to the image size of REC task [89,88,13]. It is smaller than the size of DET task's images [23] which might limit performance on smaller objects.…”

Section: Task Unification and Multi-task Learningmentioning

confidence: 92%

“…Concretely, we utilize the more expressive cross-attention fusion on lower resolution features, and the more efficient product fusion on higher resolution features to combine the best of both worlds. Last but not least, we discover that a standard object detector and detection losses [69] are sufficient and surprisingly effective for REC, LOC, and DET tasks without a need for task-specific design and losses [13,21,51,55,88,89,91]. In short, FindIt is a simple, efficient, and end-to-end trainable model for unified visual grounding and object detection.…”

Section: Introductionmentioning

confidence: 89%

“…Yu et al and Mao et al [92,56] expand the COCO benchmark with referring expression annotations, while the Referit game [72] crowd-sources such labels through game-play. One-stage [7,44,89,13,55] and two-stage [94,91,27,83,87,50] methods have been popular for these tasks.…”

Section: Related Workmentioning

confidence: 99%

“…To localize objects at various scales, existing REC works have used multi-level fusion by applying activation and product fusion [55,88] or concatenation and convolution fusion [89]. Inspired by recent works [13,8,53,37] on single-scale cross-attention, we propose multi-scale fusion to satisfy the disparate requirements of REC and detection tasks, where REC requires complex reasoning while detection requires accurate localization and recognition. The fusion module enables us to unify these tasks in a single model and surpass the state of the art on REC, LOC and maintains competitive DET.…”

Section: Related Workmentioning

confidence: 99%

“…All losses have equal weights across tasks. We note that it is unclear how to use existing grounding models out-of-the-box for task unification due to the task-specific architectures, losses, and training strategies [13,21,51,55,88,89,91].…”

Section: Task Unification and Multi-task Learningmentioning

confidence: 99%

See 4 more Smart Citations

FindIt: Generalized Localization with Natural Language Queries

Kuo¹,

Bertsch²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong singletask baselines. All of these are accomplished by a single, unified and efficient model. The code will be released.

show abstract