2022
DOI: 10.1609/aaai.v36i3.20150
|View full text |Cite
|
Sign up to set email alerts
|

Scaled ReLU Matters for Training Vision Transformers

Abstract: Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in the paper Early Convolutions Help Transformers See Better, and the authors conjecture that the issue lies with the patchify-stem of ViT models. In this paper, we further investigate this p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
6
2
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 23 publications
(5 citation statements)
references
References 57 publications
0
4
0
Order By: Relevance
“…Since then it has been widely used in natural language processing (NLP), such as BERT [42]. Due to the successful application of Transformer in the field of NLP, Transformer has been concerned to be applied in the field of computer vision in recent years, such as Vision Transformer (ViT) [43], Swin Transformer [44], et al ViT divides the input image into non-overlapping image blocks and linearly projects each image block into a d-dimensional feature vector using the learnable weight matrix [45]. Inspired by ViT, the spectrum is divided into several patches of the same sequence length as the input of Transformer to reduce the length of the input sequence, facilitating straightforward processing and analysis with lower computational complexity.…”
Section: A Transformer Based Encoder For Htdmentioning
confidence: 99%
“…Since then it has been widely used in natural language processing (NLP), such as BERT [42]. Due to the successful application of Transformer in the field of NLP, Transformer has been concerned to be applied in the field of computer vision in recent years, such as Vision Transformer (ViT) [43], Swin Transformer [44], et al ViT divides the input image into non-overlapping image blocks and linearly projects each image block into a d-dimensional feature vector using the learnable weight matrix [45]. Inspired by ViT, the spectrum is divided into several patches of the same sequence length as the input of Transformer to reduce the length of the input sequence, facilitating straightforward processing and analysis with lower computational complexity.…”
Section: A Transformer Based Encoder For Htdmentioning
confidence: 99%
“…As is known, a conventional transformer is initially designed for handling sequential data in NLP, and how to map the image to a patch sequence is vital for a vision transformer. ViT [24] directly splits the input image into 16 × 16 non-overlap patches, while other recent works [40] find that convolution in patch embedding makes a significant contribution in mapping the image to a token sequence with higher quality. Following the existing works [21,26] adopting overlapped patch embedding, we first take a 7 × 7 convolution layer with a stride of 2 as the first layer in the patch embedding, followed by an extra 3 × 3 convolution layer with a stride of 1.…”
Section: Patch Embeddingmentioning
confidence: 99%
“…We adopt the idea of [62] to parameterize the ReLU function. The function is extended into scaled ReLU (sReLU) [64]: where W a is the scaling matrix. In order to preserve the gradient stability of the adaptation process, we follow two design choices from [59]: (1) Unlike with [64], we do not parameterize the negative values; (2) W a is initialized as an identity matrix and restricted to be diagonal.…”
Section: B Structural Transformation On Relumentioning
confidence: 99%