2022
DOI: 10.48550/arxiv.2201.09450
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
44
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 32 publications
(54 citation statements)
references
References 0 publications
0
44
0
Order By: Relevance
“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”
Section: Vision Transformermentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”
Section: Vision Transformermentioning
confidence: 99%
“…A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25]. Due to the impressive performance, Transformer has also been introduced for low-level vision tasks [5,54,37,29,3,62,28,26].…”
Section: Vision Transformermentioning
confidence: 99%
See 2 more Smart Citations
“…Recently, SETR [27] and Segmenter [58] directly adopt vision transformers [22], [23] as the backbone, which capture global context from very early layers. SegFormer [59], PVT [60], Swin [24], and UniFormer [61] create hierarchical structures to make use of multi-resolution features. Leveraging the advance of DETR [62], MaX-DeepLab [63] and MaskFormer [64] view image segmentation from the perspective of mask classification.…”
Section: Transformer-driven Semantic Segmentationmentioning
confidence: 99%