2021
DOI: 10.48550/arxiv.2105.03404
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ResMLP: Feedforward networks for image classification with data-efficient training

Abstract: We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on Im… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
223
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 86 publications
(226 citation statements)
references
References 26 publications
(45 reference statements)
2
223
0
Order By: Relevance
“…However, this was for a relatively shallow model, and we cannot guarantee that LayerNorm would not hinder ImageNet-scale models to an even larger degree. We note that the authors of ResMLP also saw a relatively small increase in accuracy for replacing LayerNorm with BatchNorm, but for a largerscale experiment (Touvron et al, 2021a). We conclude that BatchNorm is no more crucial to our architecture than other regularizations or parameter settings (e.g., kernel size).…”
Section: B E Cifar-10mentioning
confidence: 69%
See 2 more Smart Citations
“…However, this was for a relatively shallow model, and we cannot guarantee that LayerNorm would not hinder ImageNet-scale models to an even larger degree. We note that the authors of ResMLP also saw a relatively small increase in accuracy for replacing LayerNorm with BatchNorm, but for a largerscale experiment (Touvron et al, 2021a). We conclude that BatchNorm is no more crucial to our architecture than other regularizations or parameter settings (e.g., kernel size).…”
Section: B E Cifar-10mentioning
confidence: 69%
“…These models look similar to repeated transformer-encoder blocks (Vaswani et al, 2017) with different operations replacing the self-attention and MLP operations. For example, MLP-Mixer (Tolstikhin et al, 2021) replaces them both with MLPs applied across different dimensions (i.e., spatial and channel location mixing); ResMLP (Touvron et al, 2021a) is a data-efficient variation on this theme. CycleMLP (Chen et al, 2021), gMLP , and vision permutator (Hou et al, 2021), replace one or both blocks with various novel operations.…”
Section: R Wmentioning
confidence: 99%
See 1 more Smart Citation
“…The authors of [53] proposed the gMLP, which applies a spatial gating unit on visual tokens. ResMLP [86] adopts an Affine transformation as a substitute to Layer Normalization for acceleration.…”
Section: Related Workmentioning
confidence: 99%
“…Methods such as Linformer [24], Nystr ömformer [9] and Performer [8] reduce the quadratic complexity from O(n 2 ) to a linear O(n). More recently, a group of attention-free Multi-Layer Perceptron (MLP) based approaches such as MLP-Mixer [25] and ResMLP [26] have been proposed, that strive to obtain performance similar to that of Transformers, while reducing the computational cost by removing the Self-Attention mechanism all together and employing MLPs in conjunction with transposition in order to preserve a global receptive field [27].…”
Section: Transformer Architecturesmentioning
confidence: 99%