A Survey on Vision Transformer

Han, Kai; Wang, Yunhe; Chen, Hanting; Chen, Xinghao; Guo, Jianyuan; Liu, Zhenhua; Tang, Yehui; Xiao, An; Xu, Chunjing; Xu, Yixing; Yang, Zhaohui; Zhang, Yiman; Tao, Dacheng

doi:10.1109/tpami.2022.3152247

Cited by 926 publications

(173 citation statements)

References 110 publications

Supporting

Mentioning

172

Contrasting

Order By: Relevance

“…Vision transformers In computer vision, convolutional networks have become by far the dominating model class over the last decade. Vision transformers [33] break with the long tradition of using convolutions and are rapidly gaining traction [56]. We find that the best vision transformer (ViT-L trained on 14M images) even exceeds human OOD accuracy (Figure 1a shows the average across 17 datasets).…”

Section: Modelsmentioning

confidence: 93%

Partial success in closing the gap between human and machine vision

Geirhos¹,

Narayanappa²,

Mitzkus³

et al. 2021

Preprint

View full text Add to dashboard Cite

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-ofdistribution (OOD) datasets, adding the "missing human baseline" by recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding robustness gap between humans and CNNs is closing, with the best models now matching or exceeding human performance on most OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data are provided as a benchmark at https://github.com/bethgelab/model-vs-human/.Preprint. Under review.

show abstract

Section: Modelsmentioning

confidence: 93%

Partial success in closing the gap between human and machine vision

Geirhos¹,

Narayanappa²,

Mitzkus³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Inspired by the major success of transformer architectures in the field of NLP, researchers have recently applied transformer to computer vision (CV) tasks [13]. Chen et al [6] trained a sequence transformer to auto-regressively predict pixels, achieving results comparable to CNNs on image classification tasks.…”

Section: Vision Transformermentioning

confidence: 99%

Post-Training Quantization for Vision Transformer

Liu¹,

Wang²,

Han³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, transformer has achieved remarkable performance on a variety of computer vision applications. Compared with mainstream convolutional neural networks, vision transformers are often of sophisticated architectures for extracting powerful feature representations, which are more difficult to be developed on mobile devices. In this paper, we present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. Basically, the quantization task can be regarded as finding the optimal low-bit quantization intervals for weights and inputs, respectively. To preserve the functionality of the attention mechanism, we introduce a ranking loss into the conventional quantization objective that aims to keep the relative order of the self-attention results after quantization. Moreover, we thoroughly analyze the relationship between quantization loss of different layers and the feature diversity, and explore a mixed-precision quantization scheme by exploiting the nuclear norm of each attention map and output feature. The effectiveness of the proposed method is verified on several benchmark models and datasets, which outperforms the stateof-the-art post-training quantization algorithms. For instance, we can obtain an 81.29% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.

show abstract

“…Finally, a full connection layer is connected to complete the ViT image classification task. [50]. Meanwhile, the main highlight of ViT is to show that it does not rely on convolutional neural networks and can also achieve good results in image classification [17].…”

Section: Visual Transformersmentioning

confidence: 99%

A Comparison for Anti-noise Robustness of Deep Learning Classification Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to Visual Transformer and Performer

Chen¹,

Li²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Image classification has achieved unprecedented advance with the the rapid development of deep learning. However, the classification of tiny object images is still not well investigated. In this paper, we first briefly review the development of Convolutional Neural Network and Visual Transformer in deep learning, and introduce the sources and development of conventional noises and adversarial attacks. Then we use various models of Convolutional Neural Network and Visual Transformer to conduct a series of experiments on the image dataset of tiny objects (sperms and impurities), and compare various evaluation metrics in the experimental results to obtain a model with stable performance. Finally, we discuss the problems in the classification of tiny objects and make a prospect for the classification of tiny objects in the future.

show abstract

A Survey on Vision Transformer

Cited by 926 publications

References 110 publications

Partial success in closing the gap between human and machine vision

Partial success in closing the gap between human and machine vision

Post-Training Quantization for Vision Transformer

A Comparison for Anti-noise Robustness of Deep Learning Classification Methods on a Tiny Object Image Dataset: from Convolutional Neural Network to Visual Transformer and Performer

Contact Info

Product

Resources

About