Do Vision Transformers See Like Convolutional Neural Networks?

Raghu, Maithra; Unterthiner, Thomas; Kornblith, Simon; Zhang, Chiyuan; Dosovitskiy, Alexey

doi:10.48550/arxiv.2108.08810

Cited by 42 publications

(52 citation statements)

References 38 publications

(58 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ablation Study The class attention A can obtained from any Tranformer Block in ViTs. Due to the global receptive field, the class attention would not have big difference across blocks [36,14]. We first study the effect of attention matrix generated in different depth d for DeiT-S. Then we follow [1,5] to compute the attention rollout, which aggregate the attention matrices from all blocks by matrix multiplications.…”

Section: B Additional Resultsmentioning

confidence: 99%

TransMix: Attend to Mix for Vision Transformers

Chen¹,

Sun²,

He³

et al. 2021

Preprint

View full text Add to dashboard Cite

Mixup-based augmentation has been found to be effective for generalizing models during training, especially for Vision Transformers (ViTs) since they can easily overfit. However, previous mixup-based methods have an underlying prior knowledge that the linearly interpolated ratio of targets should be kept the same as the ratio proposed in input interpolation. This may lead to a strange phenomenon that sometimes there is no valid object in the mixed image due to the random process in augmentation but there is still response in the label space. To bridge such gap between the input and label spaces, we propose TransMix, which mixes labels based on the attention maps of Vision Transformers. The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map. TransMix is embarrassingly simple and can be implemented in just a few lines of code without introducing any extra parameters and FLOPs to ViT-based models. Experimental results show that our method can consistently improve various ViT-based models at scales on ImageNet classification. After pre-trained with TransMix on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection and instance segmentation. TransMix also exhibits to be more robust when evaluating on 4 different benchmarks. Code will be made publicly available at https://github.com/Beckschen/TransMix.

show abstract

Section: B Additional Resultsmentioning

confidence: 99%

TransMix: Attend to Mix for Vision Transformers

Chen¹,

Sun²,

He³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We were surprised to find that both convolution-and transformer-based backbone networks attain similar performance. We conjecture that although it has been widely studied that convolutions and transformers see differently [47], as they are pretrained on the same dataset [7], the representations learned by models are almost alike. Note that we only utilized backbones with pyramidal structure, and the results may differ if other backbone networks are used, which we leave this exploration for the future work.…”

Section: Ablation Studymentioning

confidence: 98%

“…What about transformer-based backbone networks? As addressed in many works [9,47], CNN and transformers see images differently, which means that the kinds of backbone networks may affect the performance significantly, but this has never been explored in this task. We thus exploit several well-known vision transformer architectures to explore the potential differences that probably exist.…”

Section: Ablation Studymentioning

confidence: 99%

Cost Aggregation Is All You Need for Few-Shot Segmentation

Hong¹,

Cho²,

Nam³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce a novel cost aggregation network, dubbed Volumetric Aggregation with Transformers (VAT), to tackle the few-shot segmentation task by using both convolutions and transformers to efficiently handle high dimensional correlation maps between query and support. In specific, we propose our encoder consisting of volume embedding module to not only transform the correlation maps into more tractable size but also inject some convolutional inductive bias and volumetric transformer module for the cost aggregation. Our encoder has a pyramidal structure to let the coarser level aggregation to guide the finer level and enforce to learn complementary matching scores. We then feed the output into our affinity-aware decoder along with the projected feature maps for guiding the segmentation process. Combining these components, we conduct experiments to demonstrate the effectiveness of the proposed method, and our method sets a new state-of-the-art for all the standard benchmarks in few-shot segmentation task. Furthermore, we find that the proposed method attains state-of-the-art performance even for the standard benchmarks in semantic correspondence task although not specifically designed for this task. We also provide an extensive ablation study to validate our architectural choices. The trained weights and codes are available at: https: //seokju-cho.github.io/VAT/.

show abstract

“…The Vision Transformer (ViT) architecture is firstly proposed in (Dosovitskiy et al, 2020), which uses the attention mechanism (Vaswani et al, 2017) to solve various vision tasks. Compared to traditional CNN structures that operate on a fixed-sized window with restricted spatial interactions (Raghu et al, 2021), ViT allows all the positions in an image to interact through transformer blocks. Since then, many variants have been proposed (Graham et al, 2021;Liu et al, 2021c;Yuan et al, 2021a;Wang et al, 2021b;Han et al, 2021;Wu et al, 2021;Chen et al, 2021b;Steiner et al, 2021;El-Nouby et al, 2021;Liu et al, 2021a;Wang et al, 2021a;Bao et al, 2021).…”

Section: Vision Transformersmentioning

confidence: 99%

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Sun¹,

Ma²,

Kang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.

show abstract

Do Vision Transformers See Like Convolutional Neural Networks?

Cited by 42 publications

References 38 publications

TransMix: Attend to Mix for Vision Transformers

TransMix: Attend to Mix for Vision Transformers

Cost Aggregation Is All You Need for Few-Shot Segmentation

VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

Contact Info

Product

Resources

About