2022
DOI: 10.48550/arxiv.2201.03014
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Glance and Focus Networks for Dynamic Visual Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 0 publications
0
2
0
Order By: Relevance
“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”
Section: Vision Transformermentioning
confidence: 99%
“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”
Section: Vision Transformermentioning
confidence: 99%
“…Visual grounding (VG) task [13,24,40,65] has achieved great progress in recent years, with the advances in both computer vision [16,20,21,25,26,46,56,57,59] and natural language processing [4,14,41,50,53]. It aims to localize the objects referred by natural language queries, which is essential for various vision-language tasks, e.g., visual question answering [2] and visual commonsense reasoning [67].…”
Section: Introductionmentioning
confidence: 99%