2019
DOI: 10.48550/arxiv.1911.03584
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Relationship between Self-Attention and Convolutional Layers

Abstract: Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can per… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
74
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 70 publications
(77 citation statements)
references
References 6 publications
3
74
0
Order By: Relevance
“…Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al, 2019) and can be initialized or regularized to be like it (d'Ascoli et al, 2021); other works simply add convolution operations to transformers (Dai et al, 2021;Guo et al, 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks . Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models Ramachandran et al, 2019;Bello, 2021).…”
Section: R Wmentioning
confidence: 99%
“…Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al, 2019) and can be initialized or regularized to be like it (d'Ascoli et al, 2021); other works simply add convolution operations to transformers (Dai et al, 2021;Guo et al, 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks . Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models Ramachandran et al, 2019;Bello, 2021).…”
Section: R Wmentioning
confidence: 99%
“…Developing non-convolutional neural networks to tackle computer vision tasks, particularly Transformer neural networks [44] has been an active area of research. Prior works have looked at local multiheaded self-attention, drawing from the structure of convolutional receptive fields [30,36], directly combining CNNs with self-attention [4,2,46] or applying Transformers to smaller-size images [6,9]. In comparison to these, the Vision Transformer [14] performs even less modification to the Transformer architecture, making it especially interesting to compare to CNNs.…”
Section: Related Workmentioning
confidence: 99%
“…These Vision Transformers (ViT) operate almost identically to Transformers used in language [13], using self-attention, rather than convolution, to aggregate information across locations. This is in contrast with a large body of prior work, which has focused on more explicitly incorporating image-specific inductive biases [30,9,4] This breakthrough highlights a fundamental question: how are Vision Transformers solving these image based tasks? Do they act like convolutions, learning the same inductive biases from scratch?…”
Section: Introductionmentioning
confidence: 96%
“…There has been researches revealing that, with certain techniques regularizing the head subspace, multi-head attention can learn desired diverse representations [12,16,18]. Considering that the spatial information becomes abstract after downsampling, we intend to strengthen the spatially representational power of multi-head attention.…”
Section: Large Window Attentionmentioning
confidence: 99%
“…3, branches of large window attention provide three hierarchies of receptive fields for the local window. Following the previous literature on window attention mechanism [30], we set the patch size of local window to 8, thus the provided receptive fields are of (16,32,64). The image pooling branch uses a global pooling layer to obtain the globally contextual information and push it into a linear transformation followed by a bilinearly upsampling opeartion to match the feature dimension.…”
Section: Lawinasppmentioning
confidence: 99%