2021
DOI: 10.48550/arxiv.2106.05786
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CAT: Cross Attention in Vision Transformer

Abstract: Since Transformer has found widespread use in NLP, the potential of Transformer in CV has been realized and has inspired many new approaches. However, the computation required for replacing word tokens with image patches for Transformer after the tokenization of the image is vast(e.g., ViT), which bottlenecks model training and inference. In this paper, we propose a new attention mechanism in Transformer termed Cross Attention, which alternates attention inner the image patch instead of the whole image to capt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 66 publications
0
5
0
Order By: Relevance
“…Due to the time limit, we did not finish the experiment to test the performance of the model with more encoder and decoder layers that are connected by a skip connection going through a ViT Block, or test ViT with different embedding dimensions. Also, inspired by the recent work of innovative Transformer models for image classification or segmentation that reduces computation complexity by special algorithms [19,9,10] , we would also try to introduce them into our model in the future to improve the performance and decrease the computation complexity.…”
Section: Discussionmentioning
confidence: 99%
“…Due to the time limit, we did not finish the experiment to test the performance of the model with more encoder and decoder layers that are connected by a skip connection going through a ViT Block, or test ViT with different embedding dimensions. Also, inspired by the recent work of innovative Transformer models for image classification or segmentation that reduces computation complexity by special algorithms [19,9,10] , we would also try to introduce them into our model in the future to improve the performance and decrease the computation complexity.…”
Section: Discussionmentioning
confidence: 99%
“…Swin transformer [34] designs the shifted window-based multi-head attentions to reduce the computation cost. CAT [70] alternately applies attention inner patch and between patches to maintain the performance with lower computational cost and builds a cross attention hierarchical network. Due to the perfect performance of Swin Transformer, it is used as the backbone network.…”
Section: Transformermentioning
confidence: 99%
“…To enhance the ability of local feature extraction and retain the non-convolution structure, many works [27][28][29] adapted to the patch structure through the local self-attention mechanism. For example, Swin Transformer limits the attention to one window, which introduces the locality of convolution operation and saves the amount of calculation.…”
Section: Transformer With Local Attention Enhancementmentioning
confidence: 99%