On the Relationship between Self-Attention and Convolutional Layers

Cordonnier, Jean-Baptiste; Loukas, Andreas; Jäggi, Martin

doi:10.48550/arxiv.1911.03584

Cited by 70 publications

(77 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many efforts have been made to incorporate features of convolutional networks into vision transformers and vice versa. Self-attention can emulate convolution (Cordonnier et al, 2019) and can be initialized or regularized to be like it (d'Ascoli et al, 2021); other works simply add convolution operations to transformers (Dai et al, 2021;Guo et al, 2021), or include downsampling to be more like traditional pyramid-shaped convolutional networks . Conversely, self-attention or attention-like operations can supplement or replace convolution in ResNet-style models Ramachandran et al, 2019;Bello, 2021).…”

Section: R Wmentioning

confidence: 99%

Patches Are All You Need?

Trockman¹,

Kolter²

2022

Preprint

View full text Add to dashboard Cite

Section: R Wmentioning

confidence: 99%

Patches Are All You Need?

Trockman¹,

Kolter²

2022

Preprint

View full text Add to dashboard Cite

“…Developing non-convolutional neural networks to tackle computer vision tasks, particularly Transformer neural networks [44] has been an active area of research. Prior works have looked at local multiheaded self-attention, drawing from the structure of convolutional receptive fields [30,36], directly combining CNNs with self-attention [4,2,46] or applying Transformers to smaller-size images [6,9]. In comparison to these, the Vision Transformer [14] performs even less modification to the Transformer architecture, making it especially interesting to compare to CNNs.…”

Section: Related Workmentioning

confidence: 99%

“…These Vision Transformers (ViT) operate almost identically to Transformers used in language [13], using self-attention, rather than convolution, to aggregate information across locations. This is in contrast with a large body of prior work, which has focused on more explicitly incorporating image-specific inductive biases [30,9,4] This breakthrough highlights a fundamental question: how are Vision Transformers solving these image based tasks? Do they act like convolutions, learning the same inductive biases from scratch?…”

Section: Introductionmentioning

confidence: 96%

Do Vision Transformers See Like Convolutional Neural Networks?

Raghu¹,

Unterthiner²,

Kornblith³

et al. 2021

Preprint

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have so far been the de-facto model for visual data. Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks. This raises a central question: how are Vision Transformers solving these tasks? Are they acting like convolutional networks, or learning entirely different visual representations? Analyzing the internal representation structure of ViTs and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as ViT having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and ViT residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating ViTs successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.

show abstract

“…There has been researches revealing that, with certain techniques regularizing the head subspace, multi-head attention can learn desired diverse representations [12,16,18]. Considering that the spatial information becomes abstract after downsampling, we intend to strengthen the spatially representational power of multi-head attention.…”

Section: Large Window Attentionmentioning

confidence: 99%

“…3, branches of large window attention provide three hierarchies of receptive fields for the local window. Following the previous literature on window attention mechanism [30], we set the patch size of local window to 8, thus the provided receptive fields are of (16,32,64). The image pooling branch uses a global pooling layer to obtain the globally contextual information and push it into a linear transformation followed by a bilinearly upsampling opeartion to match the feature dimension.…”

Section: Lawinasppmentioning

confidence: 99%

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Yan¹,

Zhang²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.

show abstract

On the Relationship between Self-Attention and Convolutional Layers

Cited by 70 publications

References 6 publications

Patches Are All You Need?

Patches Are All You Need?

Do Vision Transformers See Like Convolutional Neural Networks?

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Contact Info

Product

Resources

About