2021
DOI: 10.48550/arxiv.2110.02178
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Abstract: Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision transformers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
166
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 116 publications
(177 citation statements)
references
References 20 publications
1
166
0
Order By: Relevance
“…The Cross-ViT (Chen et al, 2021a) uses a dual-branch transformer to combine image patches of different sizes to produce stronger image features, and propose a cross attention module to reduce computation. The MobileViT (Mehta & Rastegari, 2021) combines the strengths of CNNs and ViTs by replacing local processing in convolutions with global processing using transformers.…”
Section: Lightweight Vitsmentioning
confidence: 99%
See 1 more Smart Citation
“…The Cross-ViT (Chen et al, 2021a) uses a dual-branch transformer to combine image patches of different sizes to produce stronger image features, and propose a cross attention module to reduce computation. The MobileViT (Mehta & Rastegari, 2021) combines the strengths of CNNs and ViTs by replacing local processing in convolutions with global processing using transformers.…”
Section: Lightweight Vitsmentioning
confidence: 99%
“…More details of the efficiency evaluation will be discussed in Section 6.3. 72.2 5.7M × 32 PiT (Heo et al, 2021) 73.0 4.9M × 32 Cross-ViT (Chen et al, 2021a) 73.4 6.9M × 32 MobileViT (Mehta & Rastegari, 2021) 74.8 2.3M × 32…”
Section: Comparison With Other Lightweight Vitsmentioning
confidence: 99%
“…Recently, the pioneering work ViT [22] successfully applies the pure transformer-based architecture to computer vision, revealing the potential of transformer in handling visual tasks. Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87].…”
Section: Related Workmentioning
confidence: 99%
“…Additionally, the vanilla ViT models and, especially, their larger variants, are very hard to train and require to be trained on huge annotated datasets. Following ViT, many methods [16,14,9,18,19] appeared trying to solve or circumvent these issues while also maintaining SOTA performance [4,12]. In this work, we incorporate two very promising architectures, DEIT [12] and Xcit [4] in our models and evaluate their performance.…”
Section: Vision Transformersmentioning
confidence: 99%