CMT: Convolutional Neural Networks Meet Vision Transformers

Guo, Jianyuan; Han, Kai; Wu, Han; Tang, Yehui; Chen, Xinghao; Wang, Yunhe; Xu, Chang

doi:10.1109/cvpr52688.2022.01186

Cited by 366 publications

(162 citation statements)

References 35 publications

Supporting

Mentioning

106

Contrasting

Order By: Relevance

“…Some recent solutions try to use the advantages of CNN and Transformer by integrating the two architectures as a new backbone network. The CMT (Guo et al, 2022 ) block consists of a depthwise convolution-based local perception unit and a light-weight transformer module. CoAtNet (Dai et al, 2021 ) fuses the two frameworks based on MBConv and relative self-attention.…”

Section: Related Workmentioning

confidence: 99%

A medical image segmentation method based on multi-dimensional statistical features

et al. 2022

Front. Neurosci.

View full text Add to dashboard Cite

Medical image segmentation has important auxiliary significance for clinical diagnosis and treatment. Most of existing medical image segmentation solutions adopt convolutional neural networks (CNNs). Althought these existing solutions can achieve good image segmentation performance, CNNs focus on local information and ignore global image information. Since Transformer can encode the whole image, it has good global modeling ability and is effective for the extraction of global information. Therefore, this paper proposes a hybrid feature extraction network, into which CNNs and Transformer are integrated to utilize their advantages in feature extraction. To enhance low-dimensional texture features, this paper also proposes a multi-dimensional statistical feature extraction module to fully fuse the features extracted by CNNs and Transformer and enhance the segmentation performance of medical images. The experimental results confirm that the proposed method achieves better results in brain tumor segmentation and ventricle segmentation than state-of-the-art solutions.

show abstract

Section: Related Workmentioning

confidence: 99%

A medical image segmentation method based on multi-dimensional statistical features

et al. 2022

Front. Neurosci.

View full text Add to dashboard Cite

show abstract

“…Different from SVT‐Net [27], the lightweight multi‐headed self‐attention (LMHSA) is employed to structure the transformer. The architecture of LMHSA is shown in Figure 3, which is modified from CMT [37].…”

Section: Methodsmentioning

confidence: 99%

Sequence matching enhanced 3D place recognition using candidate rearrangement

Yan

Zhuang

2022

IET Cyber-Syst and Robotics

View full text Add to dashboard Cite

Deep-learning-based 3D place recognition has received more attention since the datadriven fashion is widely used for the 3D point cloud applications. Most of the existing deep-learning-based 3D place recognition methods only utilise a single scene for place recognition. However, a single scene may have measurement noise or observable dynamic object differences, which may lead to a reduction in recognition accuracy. To improve the performance of 3D place recognition, a sequence matching based rearrangement method is proposed. Our sequence matching method is based on an assignment algorithm and guides the candidate rearrangement in searching for a similar place. The global descriptor extraction adapts the effective sparse tensor representation and a simple pooling layer to obtain the global descriptor. A new loss function combination is employed to train the network. The proposed approach is evaluated on the popular 3D place recognition benchmarks, which proves the effectiveness of the proposed approach.

show abstract

“…MobileFormer architecture combines MobileNetv3 and ViT to also achieve competitive results. CMT (Guo et al, 2022) architecture has convolutional stem, convolutional layer before every transformer block and stacks convolutional layers and transformer layers alternatively. CvT (Wu et al, 2021) uses convolutional token embedding instead of linear embedding used in ViTs and a convolutional transformer layer block that leverages these convolutional token embeddings to improve performance.…”

Section: Related Workmentioning

confidence: 99%

“…PVTv2-B1 achieves 78.7% with ∼2.3x more parameters, similar FLOPs and advanced data augmentation. CMT-Ti (Guo et al, 2022) achieves 79.1% with ∼1.6x more parameters, ∼2.9x less FLOPs (due to input image size of 160x160) and advanced data augmentation.…”

Section: Models Greater Than 8 Million Parametersmentioning

confidence: 99%

“…Many recent works have introduced convolutional layers in ViT architecture to form hybrid networks to improve performance, achieve sample efficiency and make the models more efficient in terms of parameters and FLOPs like MobileViTs (MobileViTv1 (Mehta & Rastegari, 2021), Mo-bileViTv2 (Mehta & Rastegari, 2022)), CMT (Guo et al, 2022), CvT (Wu et al, 2021), PVTv2 , ResT , MobileFormer , CPVT (Chu et al, 2021), MiniViT , CoAtNet , CoaT (Xu et al, 2021a). Performance of many of these models on ImageNet-1K, with parameters and FLOPs is shown in Figure 1.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

Wadekar¹,

Chaurasia²

2022

Preprint

View full text Add to dashboard Cite

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-theart results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5,0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at https://github.com/micronDLA/MobileViTv3.

show abstract

CMT: Convolutional Neural Networks Meet Vision Transformers

Cited by 366 publications

References 35 publications

A medical image segmentation method based on multi-dimensional statistical features

A medical image segmentation method based on multi-dimensional statistical features

Sequence matching enhanced 3D place recognition using candidate rearrangement

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

Contact Info

Product

Resources

About