ResMLP: Feedforward networks for image classification with data-efficient training

Touvron, Hugo; Bojanowski, Piotr; Caron, Mathilde; Cord, Matthieu; El-Nouby, Alaaeldin; Grave, Édouard; Izacard, Gautier; Joulin, Armand; Synnaeve, Gabriel; Verbeek, Jakob

doi:10.48550/arxiv.2105.03404

Cited by 86 publications

(226 citation statements)

References 26 publications

(45 reference statements)

Supporting

Mentioning

223

Contrasting

Order By: Relevance

“…However, this was for a relatively shallow model, and we cannot guarantee that LayerNorm would not hinder ImageNet-scale models to an even larger degree. We note that the authors of ResMLP also saw a relatively small increase in accuracy for replacing LayerNorm with BatchNorm, but for a largerscale experiment (Touvron et al, 2021a). We conclude that BatchNorm is no more crucial to our architecture than other regularizations or parameter settings (e.g., kernel size).…”

Section: B E Cifar-10mentioning

confidence: 69%

“…These models look similar to repeated transformer-encoder blocks (Vaswani et al, 2017) with different operations replacing the self-attention and MLP operations. For example, MLP-Mixer (Tolstikhin et al, 2021) replaces them both with MLPs applied across different dimensions (i.e., spatial and channel location mixing); ResMLP (Touvron et al, 2021a) is a data-efficient variation on this theme. CycleMLP (Chen et al, 2021), gMLP , and vision permutator (Hou et al, 2021), replace one or both blocks with various novel operations.…”

Section: R Wmentioning

confidence: 99%

“…DeiTs were subject to more hyperparameter tuning than ConvMixers, as well as longer training times. They also used stochastic depth while we did not, which can in some cases contribute percent differences in model accuracy (Touvron et al, 2021a). It is therefore possible that further hyperparameter tuning and more epochs for ConvMixers could close the gap between the two architectures for large patches, e.g., p = 16.…”

mentioning

confidence: 95%

See 2 more Smart Citations

Patches Are All You Need?

Trockman¹,

Kolter²

2022

Preprint

View full text Add to dashboard Cite

Section: B E Cifar-10mentioning

confidence: 69%

Section: R Wmentioning

confidence: 99%

mentioning

confidence: 95%

See 1 more Smart Citation

Patches Are All You Need?

Trockman¹,

Kolter²

2022

Preprint

View full text Add to dashboard Cite

“…The authors of [53] proposed the gMLP, which applies a spatial gating unit on visual tokens. ResMLP [86] adopts an Affine transformation as a substitute to Layer Normalization for acceleration.…”

Section: Related Workmentioning

confidence: 99%

MAXIM: Multi-Axis MLP for Image Processing

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks for using Transformers and MLPs in image restoration. In this work we present a multi-axis MLP based architecture, called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for crossfeature mutual conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and 'fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.

show abstract

“…Methods such as Linformer [24], Nystr ömformer [9] and Performer [8] reduce the quadratic complexity from O(n 2 ) to a linear O(n). More recently, a group of attention-free Multi-Layer Perceptron (MLP) based approaches such as MLP-Mixer [25] and ResMLP [26] have been proposed, that strive to obtain performance similar to that of Transformers, while reducing the computational cost by removing the Self-Attention mechanism all together and employing MLPs in conjunction with transposition in order to preserve a global receptive field [27].…”

Section: Transformer Architecturesmentioning

confidence: 99%

Continual Transformers: Redundancy-Free Attention for Online Inference

Hedegaard¹,

Bakhtiarnia²,

Iosifidis³

2022

Preprint

View full text Add to dashboard Cite

Transformers are attention-based sequence transduction models, which have found widespread success in Natural Language Processing and Computer Vision applications. Yet, Transformers in their current form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference in a continual input stream. Importantly, our modification is purely to the order of computations, while the produced outputs and learned weights are identical to those of the original Multi-Head Attention. To validate our approach, we conduct experiments on visual, audio, and audio-visual classification and detection tasks, i.e. Online Action Detection on THUMOS14 and TVSeries and Online Audio Classification on GTZAN, with remarkable results. Our continual one-block transformers reduce the floating point operations by respectively 63.5× and 51.5× in the Online Action Detection and Audio Classification experiments at similar predictive performance.

show abstract

ResMLP: Feedforward networks for image classification with data-efficient training

Cited by 86 publications

References 26 publications

Patches Are All You Need?

Patches Are All You Need?

MAXIM: Multi-Axis MLP for Image Processing

Continual Transformers: Redundancy-Free Attention for Online Inference

Contact Info

Product

Resources

About