2021
DOI: 10.48550/arxiv.2104.13497
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ConTNet: Why not use convolution and transformer at the same time?

Abstract: Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (Convolution-Transformer Network), combining transformer with Con-vNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely depende… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

1
9

Authors

Journals

citations
Cited by 25 publications
(27 citation statements)
references
References 54 publications
(98 reference statements)
0
20
0
Order By: Relevance
“…DeiT [38] improves the data efficiency of training ViT with a token distillation pipeline. Apart from the sequence-to-sequence structure, the efficiency of PVT [39] and Swin Transformer [30] sparks much interests in exploring the Hierarchical Vision Transformer (HVT) [14,22,41,44]. ViT is also extended to solve the low-level tasks and dense prediction problems [2,6,20].…”
Section: Vision Transformermentioning
confidence: 99%
“…DeiT [38] improves the data efficiency of training ViT with a token distillation pipeline. Apart from the sequence-to-sequence structure, the efficiency of PVT [39] and Swin Transformer [30] sparks much interests in exploring the Hierarchical Vision Transformer (HVT) [14,22,41,44]. ViT is also extended to solve the low-level tasks and dense prediction problems [2,6,20].…”
Section: Vision Transformermentioning
confidence: 99%
“…The recent advance [12] shows that the transformer can also achieve incredible performance on computer vision tasks. While vision transformer suffers from the necessity large-scaled dataset [44], many recent works try to encode strong inductive prior by either combining it with convolutional layer [53,31,57,55] or introducing 2D-hierarchical structure to vision transformer [33,49,13,9]. Besides, transformer also shows strong power in other vision tasks, including semantic segmentation [60], object detection [5,61], image processing [10], and image generation [27,26].…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, the non-linear embedding method (SNE) also improves the classification rates on the ImageNet dataset, compared to the baseline model which uses the linear embedding method. ViT-Ti [8] 5.7M T 68.7 T2T-ViT-7 [37] 4.3M T 71.7 DeiT-Ti [29] 5.7M T 72.2 Mobile-Former-96M [3] 4.6M C+T 72.8 LocalViT-T2T [18] 4.3M C+T 72.5 PiT-Ti [11] 4.9M C+T 73.0 ConViT-Ti [5] 5.7M C+T 73.1 ConT-Ti [34] 5.8M C+T 74.9 LocalViT-T [18] 5.9M C+T 74.8 ViTAE-T [33] 4.8M C+T 75.3 EfficientNet-B0 [28] 5.3M C 76.3 CeiT-T [36] 6.4M C+T 76.4 T2T-ViT-12 [37] 6.9M T 76.5 ConT-S [34] 10.1M C+T 76.5 CoaT-Lite Tiny [32] 5.7M C+T 76.6 Swin-1G [3,19] 7.3M T 77.3 XCiT-T12 (baseline) [9] 6.7M T 77.…”
Section: Imagenet Classificationmentioning
confidence: 99%