2021
DOI: 10.48550/arxiv.2106.02689
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

RegionViT: Regional-to-Local Attention for Vision Transformers

Abstract: Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
32
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 23 publications
(41 citation statements)
references
References 42 publications
(74 reference statements)
0
32
0
Order By: Relevance
“…However, it still lacks the connections between distant patches, contradicting the intention of MSA. In contrast, our method follows a local-to-global paradigm which has achieved much success in vision tasks [7,9,56]. Our method not only preserves global receptive field at each block, but is also efficient in computation, as shown in the above analysis.…”
Section: Discussionmentioning
confidence: 95%
See 2 more Smart Citations
“…However, it still lacks the connections between distant patches, contradicting the intention of MSA. In contrast, our method follows a local-to-global paradigm which has achieved much success in vision tasks [7,9,56]. Our method not only preserves global receptive field at each block, but is also efficient in computation, as shown in the above analysis.…”
Section: Discussionmentioning
confidence: 95%
“…Comparison with Image-based ViTs. Our DualFormer can be also linked to several image-based transformers with a local-global stratified design, including RegionViT [7] and Twins-SVT [9]. The major difference between our approach and RegionViT lie in two aspects.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…We report the performance on the validation subset, and use the mean average precision (AP) as the metric. We evaluate ELSA-Swin in Mask RCNN / Cascade Mask RCNN [2,33], which is a common practice in [6,70,71,79,87]. Following the common training protocol, we apply multi-scale training, scaling the shorter side of the input from 480 to 800 while keeping the longer side no more than 1333.…”
Section: Object Detection On Cocomentioning
confidence: 99%
“…As can be seen, ELSA-Swin-T and ELSA-Swin-S (noted as ELSA-T / ELSA-S) respectively improve the corresponding baseline by 1.9 AP and 1.8 AP in detection, both outperforming other methods within their group. Note that, unlike ViL [87] and RegionViT [6], ELSA-Swin does not modify the macro…”
Section: Object Detection On Cocomentioning
confidence: 99%