Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Wang, Yulin; Huang, Rui; Song, Sejun; Huang, Zeyi; Huang, Gao

doi:10.48550/arxiv.2105.15075

Cited by 15 publications

(26 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• For segmentation, the encoder-decoder Transformer models may unify three segmentation sub-tasks into a mask prediction problem by a series of learnable mask embeddings [29], [84], [137]. This box-free approach has achieved the latest SOTA on multiple benchmarks [137].…”

Section: A Summary Of Recent Improvementsmentioning

confidence: 99%

A Survey of Visual Transformers

Liu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing. Inspired by this significant achievement, some pioneering works have recently been done on adapting Transformerliked architectures to Computer Vision (CV) fields, which have demonstrated their effectiveness on various CV tasks. Relying on competitive modeling capability, visual Transformers have achieved impressive performance on multiple benchmarks such as ImageNet, COCO and ADE20k as compared with modern Convolution Neural Networks (CNN). In this paper, we have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks (classification, detection, and segmentation), where a taxonomy is proposed to organize these methods according to their motivations, structures, and usage scenarios. Because of the differences in training settings and oriented tasks, we have also evaluated these methods on different configurations for easy and intuitive comparison instead of only various benchmarks. Furthermore, we have revealed a series of essential but unexploited aspects that may empower Transformer to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between visual and sequential Transformers. Finally, three promising future research directions are suggested for further investment.

show abstract

Section: A Summary Of Recent Improvementsmentioning

confidence: 99%

A Survey of Visual Transformers

Liu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The latter, unstructured token sparsification, is the most related work to us. Wang et al [42] proposed DVT to determine the number of patches to divide an image dynamically. Specifically, they leverage a cascade ViT models, where each ViT is re- sponsible for one type of token length.…”

Section: Model Compressionmentioning

confidence: 99%

“…It will stop inference for an input image if it has sufficient confidence in the prediction on the current token length. Different from DVT [42], our method is more accessible and practical since only a single ViT model is required. Moreover, we pay more attention on how to accurately decide the smallest number of token lengths that can give correct predication in the transformer for each image.…”

Section: Model Compressionmentioning

confidence: 99%

“…Training cost and Memory Consumption. We compare ReViT with DeiT-S and DVT [42] in terms of training cost and memory consumption, shown in Figure 7. The ReViT-B denotes the baseline approach of ReViT and ReViT-E is the efficient implementation method.…”

Section: Ablation Studymentioning

confidence: 99%

“…Figure 7. Compare our approach with DeiT-S[39] and DVT[42] for training cost and memory cost at inference time in terms of the number of predefined token-length. Our proposed ReViT is almost a cheap as training the baseline DeiT-S, while DVT requires linearly increased budget on training and memory.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Make A Long Image Short: Adaptive Token Length for Vision Transformers

Zhu¹,

Zhu²,

Du³

et al. 2021

Preprint

View full text Add to dashboard Cite

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT [39], LV-ViT [22], and TimesFormer [3]) across two tasks (image classification and action recognition).

show abstract

A survey: object detection methods from CNN to transformer

Arkin

Yadikar

et al. 2022

Multimed Tools Appl

View full text Add to dashboard Cite

Object detection is the most important problem in computer vision tasks. After AlexNet proposed, based on Convolutional Neural Network (CNN) methods have become mainstream in the computer vision field, many researches on neural networks and different transformations of algorithm structures have appeared. In order to achieve fast and accurate detection effects, it is necessary to jump out of the existing CNN framework and has great challenges. Transformer’s relatively mature theoretical support and technological development in the field of Natural Language Processing have brought it into the researcher’s sight, and it has been proved that Transformer’s method can be used for computer vision tasks, and proved that it exceeds the existing CNN method in some tasks. In order to enable more researchers to better understand the development process of object detection methods, existing methods, different frameworks, challenging problems and development trends, paper introduced historical classic methods of object detection used CNN, discusses the highlights, advantages and disadvantages of these algorithms. By consulting a large amount of paper, the paper compared different CNN detection methods and Transformer detection methods. Vertically under fair conditions, 13 different detection methods that have a broad impact on the field and are the most mainstream and promising are selected for comparison. The comparative data gives us confidence in the development of Transformer and the convergence between different methods. It also presents the recent innovative approaches to using Transformer in computer vision tasks. In the end, the challenges, opportunities and future prospects of this field are summarized.

show abstract

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Cited by 15 publications

References 27 publications

A Survey of Visual Transformers

A Survey of Visual Transformers

Make A Long Image Short: Adaptive Token Length for Vision Transformers

A survey: object detection methods from CNN to transformer

Contact Info

Product

Resources

About