EfficientFormer: Vision Transformers at MobileNet Speed

Li, Yanyu; Geng, Yong; Ye, Wen; Hu, Eric; Evangelidis, Georgios; Tulyakov, Sergey; Wang, Yanzhi; Ren, Jing

doi:10.48550/arxiv.2206.01191

Cited by 19 publications

(28 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hybrid Models. Recent works [7,17,23,29,35] have shown that combining convolution and Transformer as a hybrid architecture helps absorb the strengths of both architectures. BoTNet [29] replaces the spatial convolutions with global self-attention in the final three bottleneck blocks of ResNet.…”

Section: Related Workmentioning

confidence: 99%

“…Mobile-Former [2] combines with the proposed lightweight cross attention to model the bridge, which is not only computationally efficient, but also has more representation power. EfficientFormer [17] complies with a dimension consistent design that smoothly leverages hardware-friendly 4D MetaBlocks and powerful 3D MHSA blocks. In this paper, we design a family of Next-ViT models that adapt more to the realistic industrial scenarios.…”

Section: Related Workmentioning

confidence: 99%

“…Some recent works [4,17,23,36] have paid great efforts to combine CNN and Transformer for efficient deployment. As shown in Figure 4(b)(c), almost all of them monotonously adopt convolution blocks in the shallow stages and just stack Transformer blocks in the last one or two stages which presents effective results in the classification task.…”

Section: Next Hybrid Strategy (Nhs)mentioning

confidence: 99%

“…For example, Swin Transformer [19] and PVT [34] try to design more efficient spatial attention mechanisms to alleviate the quadratic-increasing computation complexity of MHSA. The others [4,17,23] consider combining efficient convolution blocks and powerful Transformer blocks to design CNN-Transformer hybrid architecture to obtain a better trade-off between accuracy and latency. Coincidentally, almost all existing hybrid architectures [4,17,23] adopt convolution blocks in the shallow stages and just stack Transformer block in the last few stages.…”

Section: Introductionmentioning

confidence: 99%

“…The others [4,17,23] consider combining efficient convolution blocks and powerful Transformer blocks to design CNN-Transformer hybrid architecture to obtain a better trade-off between accuracy and latency. Coincidentally, almost all existing hybrid architectures [4,17,23] adopt convolution blocks in the shallow stages and just stack Transformer block in the last few stages. However, we observe that such a hybrid strategy is effortless to lead to performance saturation on downstream tasks (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Li¹,

Xia²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.4 mAP (from 40.4 to 45.8) on COCO detection and 8.2% mIoU (from 38.8% to 47.0%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6×. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.2% to 48.7%) on ADE20K segmentation under similar latency. Code will be released recently.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Next Hybrid Strategy (Nhs)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Li¹,

Xia²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Depthwise Convolution with Channel Mixer: Rethinking MLP in MetaFormer for Faster and More Accurate Vehicle Detection

Lu,

Kang,

Huang

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

CTformer: convolution-free Token2Token dilated vision transformer for low-dose CT denoising

Wang

Fan

et al. 2023

Phys. Med. Biol.

View full text Add to dashboard Cite

Low-dose computed tomography (LDCT) denoising is an important problem in CT research. Compared to the normal dose CT (NDCT), LDCT images are subjected to severe noise and artifacts. Recently in many studies, vision transformers have shown superior feature representation ability over convolutional neural networks (CNNs). However, unlike CNNs, the potential of vision transformers in LDCT denoising was little explored so far. To fill this gap, we propose a Convolution-free Token2Token Dilated Vision Transformer (CTformer) for low-dose CT denoising. The CTformer uses a more powerful token rearrangement to encompass local contextual information and thus avoids convolution. It also dilates and shifts feature maps to capture longer-range interaction. We interpret the CTformer by statically inspecting patterns of its internal attention maps and dynamically tracing the hierarchical attention flow with an explanatory graph. Furthermore, an overlapped inference mechanism is introduced to effectively eliminate the boundary artifacts that are common for encoder-decoder-based denoising models. Experimental results on MayoLDCT dataset suggest that the CTformeroutperforms the state-of-the-art denoising methods with a low computation overhead.

show abstract

EfficientFormer: Vision Transformers at MobileNet Speed

Cited by 19 publications

References 51 publications

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Depthwise Convolution with Channel Mixer: Rethinking MLP in MetaFormer for Faster and More Accurate Vehicle Detection

CTformer: convolution-free Token2Token dilated vision transformer for low-dose CT denoising

Contact Info

Product

Resources

About