2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01186
|View full text |Cite
|
Sign up to set email alerts
|

CMT: Convolutional Neural Networks Meet Vision Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
106
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 366 publications
(162 citation statements)
references
References 35 publications
0
106
0
Order By: Relevance
“…Some recent solutions try to use the advantages of CNN and Transformer by integrating the two architectures as a new backbone network. The CMT (Guo et al, 2022 ) block consists of a depthwise convolution-based local perception unit and a light-weight transformer module. CoAtNet (Dai et al, 2021 ) fuses the two frameworks based on MBConv and relative self-attention.…”
Section: Related Workmentioning
confidence: 99%
“…Some recent solutions try to use the advantages of CNN and Transformer by integrating the two architectures as a new backbone network. The CMT (Guo et al, 2022 ) block consists of a depthwise convolution-based local perception unit and a light-weight transformer module. CoAtNet (Dai et al, 2021 ) fuses the two frameworks based on MBConv and relative self-attention.…”
Section: Related Workmentioning
confidence: 99%
“…Different from SVT‐Net [27], the lightweight multi‐headed self‐attention (LMHSA) is employed to structure the transformer. The architecture of LMHSA is shown in Figure 3, which is modified from CMT [37].…”
Section: Methodsmentioning
confidence: 99%
“…MobileFormer architecture combines MobileNetv3 and ViT to also achieve competitive results. CMT (Guo et al, 2022) architecture has convolutional stem, convolutional layer before every transformer block and stacks convolutional layers and transformer layers alternatively. CvT (Wu et al, 2021) uses convolutional token embedding instead of linear embedding used in ViTs and a convolutional transformer layer block that leverages these convolutional token embeddings to improve performance.…”
Section: Related Workmentioning
confidence: 99%
“…PVTv2-B1 achieves 78.7% with ∼2.3x more parameters, similar FLOPs and advanced data augmentation. CMT-Ti (Guo et al, 2022) achieves 79.1% with ∼1.6x more parameters, ∼2.9x less FLOPs (due to input image size of 160x160) and advanced data augmentation.…”
Section: Models Greater Than 8 Million Parametersmentioning
confidence: 99%
See 1 more Smart Citation