2020
DOI: 10.48550/arxiv.2006.03677
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Abstract: Computer vision has achieved great success using standardized image representations -pixel arrays, and the corresponding deep learning operators -convolutions. In this work, we challenge this paradigm: we instead (a) represent images as a set of visual tokens and (b) apply visual transformers to find relationships between visual semantic concepts. Given an input image, we dynamically extract a set of visual tokens from the image to obtain a compact representation for high-level semantics. We then use visual tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
170
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 155 publications
(171 citation statements)
references
References 39 publications
1
170
0
Order By: Relevance
“…Transformer is firstly proposed by [63] for machine translation. Recently, Transformer has achieved great success in high-level vision, such as image classification [1,16,17,44,69], semantic segmentation [7,44,69,80], human pose estimation [5,6,39,41,46,70], object detection [9,14,30,44,82], etc. Due to the advantage of capturing long-range dependencies and excellent performance in many high-level vision tasks, Transformer has also been introduced into low-level vision [8,10,42,67].…”
Section: Vision Transformermentioning
confidence: 99%
“…Transformer is firstly proposed by [63] for machine translation. Recently, Transformer has achieved great success in high-level vision, such as image classification [1,16,17,44,69], semantic segmentation [7,44,69,80], human pose estimation [5,6,39,41,46,70], object detection [9,14,30,44,82], etc. Due to the advantage of capturing long-range dependencies and excellent performance in many high-level vision tasks, Transformer has also been introduced into low-level vision [8,10,42,67].…”
Section: Vision Transformermentioning
confidence: 99%
“…As a special case, the patches can have spatial size 1x1, which means that the input sequence is obtained by simply mechanism, which grows quadratically with the feature resolutions. Similar to previous attention-based methods [14], [22], [95], some methods attempt to insert Transformer into CNN backbones or replace part of convolution blocks with Transformer layers [43], [44].…”
Section: Vision Transformer (Vit)mentioning
confidence: 99%
“…VTs: Considering that Convolution equivalently matches all pixels regardless of their priority, Visual Transformer (VT) [43] decouples semantic concepts of the input image into different channels and relates them densely through Transformer encoder blocks. In detail, a VT-block consists of three parts.…”
Section: Vision Transformer (Vit)mentioning
confidence: 99%
“…Transformer [90] has achieved great success in Natural Language Process and it has been applied in multiple computer vision tasks such as image recognition [20,94] and [20] to aggregate the sequential feature maps. Note that there is only an encoder but no decoder in this module since we only care about the high-level procedure features instead of the specific feature for each step after decoding.…”
Section: Vision Transformermentioning
confidence: 99%