2021
DOI: 10.48550/arxiv.2111.06091
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Survey of Visual Transformers

Abstract: Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO and ADE20k as com… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1
1

Relationship

1
9

Authors

Journals

citations
Cited by 30 publications
(29 citation statements)
references
References 108 publications
0
22
0
Order By: Relevance
“…In this section, we summarize the previous researches on breast cancer diagnosis in ultrasound images [10], [11] and the transformer-based medical image classification models [12].…”
Section: Related Workmentioning
confidence: 99%
“…In this section, we summarize the previous researches on breast cancer diagnosis in ultrasound images [10], [11] and the transformer-based medical image classification models [12].…”
Section: Related Workmentioning
confidence: 99%
“…Due to the excellent performance of ViT, many Transformer-based image classi cation models have been proposed, improving ViT from perspectives in ve categories [15]…”
Section: Vit-based Image Classi Cationmentioning
confidence: 99%
“…These models typically rely on region-based image features extracted a pre-trained object detectors based on commonly used two-staged detectors (typically Faster R-CNN model [28] or its extension Mask-RCNN [29]), or single-stage detectors (typically SSD and YOLO V3 [30]) or anchor-free detectors(e.g., [31]). Another directions are patch embedding [32,33,34,35,36]. This direction of work directly operates on patches (as a sequence of tokens with fixed length).…”
Section: Related Workmentioning
confidence: 99%