2022
DOI: 10.1088/1742-5468/ac9830
|View full text |Cite
|
Sign up to set email alerts
|

ConViT: improving vision transformers with soft convolutional inductive biases*

Abstract: Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
80
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 289 publications
(131 citation statements)
references
References 42 publications
0
80
0
Order By: Relevance
“…Models. We compared humans with 84 different DNNs representing the variety of approaches used in the field today: 50 CNNs trained on ImageNet [1,49,[53][54][55][56][57][58][59][60][61][62][63][64][64][65][66][67][68][69][70][71][72][73], 6 CNNs trained on other datasets in addition to ImageNet (which we refer to as "CNN extra data") [1,66,74], 10 vision transformers [75][76][77][78][79], 6 CNNs trained with self-supervision [80,81], and 13 models trained for robustness to noise or adversarial examples [82,83]. We used pretrained weights for each of these models supplied by their authors, with a variety of licenses (detailed in SI §2), implemented in Tensorflow 2.0, Keras, or PyTorch.…”
Section: Methodsmentioning
confidence: 99%
“…Models. We compared humans with 84 different DNNs representing the variety of approaches used in the field today: 50 CNNs trained on ImageNet [1,49,[53][54][55][56][57][58][59][60][61][62][63][64][64][65][66][67][68][69][70][71][72][73], 6 CNNs trained on other datasets in addition to ImageNet (which we refer to as "CNN extra data") [1,66,74], 10 vision transformers [75][76][77][78][79], 6 CNNs trained with self-supervision [80,81], and 13 models trained for robustness to noise or adversarial examples [82,83]. We used pretrained weights for each of these models supplied by their authors, with a variety of licenses (detailed in SI §2), implemented in Tensorflow 2.0, Keras, or PyTorch.…”
Section: Methodsmentioning
confidence: 99%
“…For the leaning layers, the gating parameter eventually converges to close to 0, indicating that the convolutional induction bias is effectively ignored. However, for the starting layers, many attention heads maintain high gating values, suggesting that the network uses the convolutional induction bias of the earlier layers to aid training [ 43 ].…”
Section: Methodsmentioning
confidence: 99%
“…Thus, some combinations of transformers and CNNs have been proposed for reducing the computational cost and number of parameters [ 29 ]. D’Ascoli et al introduced a positional self-attention mechanism equipped with convolutional inductive bias, adjusting the attention to positional and contextual information through learnable gating parameters [ 30 ]. BotNet [ 31 ] embedded multi-head self-attention (MHSA) in three bottlenecks of C5 in ResNet-50 [ 32 ] and outstood in object detection and instance segmentation tasks [ 33 ].…”
Section: Related Workmentioning
confidence: 99%