Bottleneck Transformers for Visual Recognition

Srinivas, Aravind; Lin, Tsung-Yi; Parmar, Niki; Shlens, Jonathon; Abbeel, Pieter; Vaswani, Ashish

doi:10.1109/cvpr46437.2021.01625

Cited by 857 publications

(406 citation statements)

References 21 publications

Supporting

Mentioning

296

Contrasting

Unclassified

Order By: Relevance

“…Recently, more variant ViT models, e.g., DeiT [220], PVT [221], TNT [222], and Swin [223], have been proposed for the pursuit of stronger performance. There are also plenty of works trying to augment a pure transformer block or self-attention layer with a convolution operation, e.g., BoTNet [224], CeiT [225], CoAtNet [226], CvT [227]. Some works (such as the DETR methods [228][229][230]) try combining CNN-like architectures with transformers for object detection.…”

Section: Vision Transformermentioning

confidence: 99%

Review of Image Classification Algorithms Based on Convolutional Neural Networks

et al. 2021

View full text Add to dashboard Cite

Image classification has always been a hot research direction in the world, and the emergence of deep learning has promoted the development of this field. Convolutional neural networks (CNNs) have gradually become the mainstream algorithm for image classification since 2012, and the CNN architecture applied to other visual recognition tasks (such as object detection, object localization, and semantic segmentation) is generally derived from the network architecture in image classification. In the wake of these successes, CNN-based methods have emerged in remote sensing image scene classification and achieved advanced classification accuracy. In this review, which focuses on the application of CNNs to image classification tasks, we cover their development, from their predecessors up to recent state-of-the-art (SOAT) network architectures. Along the way, we analyze (1) the basic structure of artificial neural networks (ANNs) and the basic network layers of CNNs, (2) the classic predecessor network models, (3) the recent SOAT network algorithms, (4) comprehensive comparison of various image classification methods mentioned in this article. Finally, we have also summarized the main analysis and discussion in this article, as well as introduce some of the current trends.

show abstract

Section: Vision Transformermentioning

confidence: 99%

Review of Image Classification Algorithms Based on Convolutional Neural Networks

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Alternatively, Contrastive Learning (CL) has gained popularity in the CV community as a variant of SSL for visual representation [5,6,11,14,26]. CL is based on data augmentation of a self and cotrastive term, where learning is carried out by maximizing similarities of the representations of the augmented views of the same object and minimizing similarity with respect to the conrastive object.…”

Section: Self-supervised Learningmentioning

confidence: 99%

“…The proposed architecture, shown in Fig. 1, mimics a Siamese network [1] that is commonly used in recent contrastive self-supervised models for representation learning [5,6,11,13,23,24,26]. It has two parallel networks, referred to as a student (left hand side) and teacher (right hand side) networks [6,11].…”

Section: Architecturementioning

confidence: 99%

“…Motivated by these findings, this study proposes SelfGNN, a contrastive self-supervised algorithm for graph neural networks with implicit contrastive terms. SelfGNN imitates a Siamese network [1], which has been widely used in recent contrastive self-supervised learning methods [5,6,11,13,14,23,24,26]. While SelfGNN bares some resemblance to [6,11], its unique characteristic is in contrast to virtually all SSL methods for graphs that require explicit negative sampling.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Self-supervised Graph Neural Networks without explicit negative sampling

Kefato¹,

Girdzijauskas²

2021

Preprint

View full text Add to dashboard Cite

Real world data is mostly unlabeled or only few instances are labeled. Manually labeling data is a very expensive and daunting task. This calls for unsupervised learning techniques that are powerful enough to achieve comparable results as semi-supervised/supervised techniques. Contrastive self-supervised learning has emerged as a powerful direction, in some cases outperforming supervised techniques.In this study, we propose, SelfGNN, a novel contrastive selfsupervised graph neural network (GNN) without relying on explicit contrastive terms. We leverage Batch Normalization, which introduces implicit contrastive terms, without sacrificing performance. Furthermore, as data augmentation is key in contrastive learning, we introduce four feature augmentation (FA) techniques for graphs. Though graph topological augmentation (TA) is commonly used, our empirical findings show that FA perform as good as TA. Moreover, FA incurs no computational overhead, unlike TA, which often has 𝑂 (𝑁 3 ) time complexity, 𝑁 -number of nodes.Our empirical evaluation on seven publicly available real-world data shows that, SelfGNN is powerful and leads to a performance comparable with SOTA supervised GNNs and always better than SOTA semi-supervised and unsupervised GNNs. The source code is available at https://github.com/zekarias-tilahun/SelfGNN.

show abstract

“…This article compares a series of Convolutional Neural Networks (CNNs), such as ResNet-18, 34, 50, 101 (He et al, 2016 ), VGG11, 13, 16, 19 (Simonyan and Zisserman, 2014 ), DenseNet-121, 169 (Huang et al, 2017 ), Inception-V3 (Szegedy et al, 2016 ), Xception (Chollet, 2017 ), AlexNet (Krizhevsky et al, 2012 ), GoogleNet (Szegedy et al, 2015 ), MobileNet-V2 (Sandler et al, 2018 ), ShuffeleNet-V2x0.5 (Ma et al, 2018 ), Inception-ResNet-V1 (Szegedy et al, 2017 ), and a series of visual transformers (VTs), such as vision transformer (ViT) (Dosovitskiy et al, 2020 ), BotNet (Srinivas et al, 2021 ), DeiT (Touvron et al, 2020 ), T2T-ViT (Yuan et al, 2021 ). The purpose is to find deep learning models that are suitable for EM small datasets.…”

Section: Introductionmentioning

confidence: 99%

A Comparative Study of Deep Learning Classification Methods on a Small Environmental Microorganism Image Dataset (EMDS-6): From Convolutional Neural Networks to Visual Transformers

Zhao

Rahaman

et al. 2022

Front. Microbiol.

View full text Add to dashboard Cite

In recent years, deep learning has made brilliant achievements in Environmental Microorganism (EM) image classification. However, image classification of small EM datasets has still not obtained good research results. Therefore, researchers need to spend a lot of time searching for models with good classification performance and suitable for the current equipment working environment. To provide reliable references for researchers, we conduct a series of comparison experiments on 21 deep learning models. The experiment includes direct classification, imbalanced training, and hyper-parameters tuning experiments. During the experiments, we find complementarities among the 21 models, which is the basis for feature fusion related experiments. We also find that the data augmentation method of geometric deformation is difficult to improve the performance of VTs (ViT, DeiT, BotNet, and T2T-ViT) series models. In terms of model performance, Xception has the best classification performance, the vision transformer (ViT) model consumes the least time for training, and the ShuffleNet-V2 model has the least number of parameters.

show abstract

Bottleneck Transformers for Visual Recognition

Cited by 857 publications

References 21 publications

Review of Image Classification Algorithms Based on Convolutional Neural Networks

Review of Image Classification Algorithms Based on Convolutional Neural Networks

Self-supervised Graph Neural Networks without explicit negative sampling

A Comparative Study of Deep Learning Classification Methods on a Small Environmental Microorganism Image Dataset (EMDS-6): From Convolutional Neural Networks to Visual Transformers

Contact Info

Product

Resources

About