2022
DOI: 10.48550/arxiv.2204.01697
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MaxViT: Multi-Axis Vision Transformer

Abstract: Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 26 publications
(27 citation statements)
references
References 61 publications
0
17
0
Order By: Relevance
“…In computer vision, nonlocal neural networks [33] also show that adding self-attention layer after convolution layers enables the model to capture more global information and improves the performance on various vision tasks. Recently, a series of vision Transformer variants that apply convolution and self-attention sequentially are also proposed including CvT [34], CoAtNet [35], ViTAEv2 [36], MaxVit [37]. In speech, Gulati et al [28] introduce Conformer models for ASR and show that adding a convolution block after the self-attention block achieves the best performance compared to applying it before or in parallel with the self-attention.…”
Section: Sequentiallymentioning
confidence: 99%
“…In computer vision, nonlocal neural networks [33] also show that adding self-attention layer after convolution layers enables the model to capture more global information and improves the performance on various vision tasks. Recently, a series of vision Transformer variants that apply convolution and self-attention sequentially are also proposed including CvT [34], CoAtNet [35], ViTAEv2 [36], MaxVit [37]. In speech, Gulati et al [28] introduce Conformer models for ASR and show that adding a convolution block after the self-attention block achieves the best performance compared to applying it before or in parallel with the self-attention.…”
Section: Sequentiallymentioning
confidence: 99%
“…Transformer [76] is the de-facto standard architecture in natural language processing. Recently, it has been used in dealing with vision problems by viewing pixels or image patches as tokens [6,16], achieving remarkable performance gains in various computer vision tasks, including image classification [16,36,50,73], object detection [75,49,83], semantic segmentation [82,15,65], etc.…”
Section: Vision Transformermentioning
confidence: 99%
“…CNNs [43][44][45][46][47] are the de-facto model for vision tasks due to their outstanding ability to model local dependency [47][48][49] as well as extract high-frequency [19]. With these advantages, CNNs are rapidly introduced into Transformers in a serial or parallel manner [23][24][25][26][50][51][52]. For serial methods, convolutions are applied at different positions of the Transformer.…”
Section: Related Workmentioning
confidence: 99%