2022
DOI: 10.48550/arxiv.2209.15159
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

Abstract: MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-theart results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our prop… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 14 publications
(19 citation statements)
references
References 26 publications
(47 reference statements)
0
7
0
Order By: Relevance
“…Following this, we bring the channel dimension back to its original size using a 1 × 1 convolutional layer and then concatenate it with the output feature of the local representation block. This strategy, distinct from MobileViT's concatenation of input and global representation features, has proven effective in MobileViTv3 (Wadekar and Chaurasia 2022). This effectiveness is attributed to the stronger correlation between local and global representation features, and the fact that local representation features possess a higher channel count compared to input features, allowing more channel information to be integrated into the fusion block.…”
Section: Swin Mobilevit Blockmentioning
confidence: 99%
“…Following this, we bring the channel dimension back to its original size using a 1 × 1 convolutional layer and then concatenate it with the output feature of the local representation block. This strategy, distinct from MobileViT's concatenation of input and global representation features, has proven effective in MobileViTv3 (Wadekar and Chaurasia 2022). This effectiveness is attributed to the stronger correlation between local and global representation features, and the fact that local representation features possess a higher channel count compared to input features, allowing more channel information to be integrated into the fusion block.…”
Section: Swin Mobilevit Blockmentioning
confidence: 99%
“…The YOLOv8s architecture, a well-established detector, consists of three integral components: a backbone network, neck, and head. To optimally configure this detector for infrared imagery, we engineered an augmented version, which entails incorporating a specialized infrared feature extraction module and supplanting the standard backbone with the advanced MobileVITv3 [31] network. This dedicated infrared feature extraction module is adept at discerning salient features unique to infrared imagery, thereby enriching the feature representation.…”
Section: Detectorsmentioning
confidence: 99%
“…As shown in Figure 4, MobileViTv3 [31] is the third generation model of MobileViT, a lightweight vision model for mobile, which combines CNNs and visual transformers (ViTs) to outperform the accuracy of a model with around 8 million covariates on the classification task with around 6 million covariates through a basic training methodology, and outperform the vast majority of mainstream models. Compared to the Darknet-53, the backbone network of YOLOv8s, MobileViTv3 features a more lightweight structure that can achieve better performance while reducing both the model size and computational demands.…”
Section: Mobilevitv3 Backbone Networkmentioning
confidence: 99%
“…Subsequently, MobileVitv2 [39] proposed a linear time complexity SA method, which greatly reduced the resource occupation of the model. MobileVitv3 [40] improved the generalisation of the model by integrating context multi-scale features. Hybrid model makes full use of CNN's global field of vision of local bias and Transformer structure.…”
Section: Efficient Modelmentioning
confidence: 99%