2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01625
|View full text |Cite
|
Sign up to set email alerts
|

Bottleneck Transformers for Visual Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
296
1
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 857 publications
(406 citation statements)
references
References 21 publications
2
296
1
1
Order By: Relevance
“…Recently, more variant ViT models, e.g., DeiT [220], PVT [221], TNT [222], and Swin [223], have been proposed for the pursuit of stronger performance. There are also plenty of works trying to augment a pure transformer block or self-attention layer with a convolution operation, e.g., BoTNet [224], CeiT [225], CoAtNet [226], CvT [227]. Some works (such as the DETR methods [228][229][230]) try combining CNN-like architectures with transformers for object detection.…”
Section: Vision Transformermentioning
confidence: 99%
“…Recently, more variant ViT models, e.g., DeiT [220], PVT [221], TNT [222], and Swin [223], have been proposed for the pursuit of stronger performance. There are also plenty of works trying to augment a pure transformer block or self-attention layer with a convolution operation, e.g., BoTNet [224], CeiT [225], CoAtNet [226], CvT [227]. Some works (such as the DETR methods [228][229][230]) try combining CNN-like architectures with transformers for object detection.…”
Section: Vision Transformermentioning
confidence: 99%
“…Alternatively, Contrastive Learning (CL) has gained popularity in the CV community as a variant of SSL for visual representation [5,6,11,14,26]. CL is based on data augmentation of a self and cotrastive term, where learning is carried out by maximizing similarities of the representations of the augmented views of the same object and minimizing similarity with respect to the conrastive object.…”
Section: Self-supervised Learningmentioning
confidence: 99%
“…The proposed architecture, shown in Fig. 1, mimics a Siamese network [1] that is commonly used in recent contrastive self-supervised models for representation learning [5,6,11,13,23,24,26]. It has two parallel networks, referred to as a student (left hand side) and teacher (right hand side) networks [6,11].…”
Section: Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…This article compares a series of Convolutional Neural Networks (CNNs), such as ResNet-18, 34, 50, 101 (He et al, 2016 ), VGG11, 13, 16, 19 (Simonyan and Zisserman, 2014 ), DenseNet-121, 169 (Huang et al, 2017 ), Inception-V3 (Szegedy et al, 2016 ), Xception (Chollet, 2017 ), AlexNet (Krizhevsky et al, 2012 ), GoogleNet (Szegedy et al, 2015 ), MobileNet-V2 (Sandler et al, 2018 ), ShuffeleNet-V2x0.5 (Ma et al, 2018 ), Inception-ResNet-V1 (Szegedy et al, 2017 ), and a series of visual transformers (VTs), such as vision transformer (ViT) (Dosovitskiy et al, 2020 ), BotNet (Srinivas et al, 2021 ), DeiT (Touvron et al, 2020 ), T2T-ViT (Yuan et al, 2021 ). The purpose is to find deep learning models that are suitable for EM small datasets.…”
Section: Introductionmentioning
confidence: 99%