ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Touvron, Hugo; Bojanowski, Piotr; Caron, Mathilde; Cord, Matthieu; El-Nouby, Alaaeldin; Grave, Édouard; Izacard, Gautier; Joulin, Armand; Synnaeve, Gabriel; Verbeek, Jakob; Jeǵou, Hervé

doi:10.1109/tpami.2022.3206148

Cited by 520 publications

(483 citation statements)

References 49 publications

Supporting

Mentioning

180

Contrasting

Unclassified

Order By: Relevance

“…Despite the simple architecture design, ConvFormer attains consistent improvement over other state-of-the-art models. For example, ConvFormer-S outperform ResMLP-24 (Touvron et al, 2021a) by 3.4% top-1 accuracy, while requiring less parameters (30.0 M→ 26.7 M) and computation cost (6.0 G→ 5.0 G FLOPs). When compared with recently well-established ViTs like Swin-S (Liu et al, 2021b) and Focal-S (Yang et al, 2021), ConvFormer also shows better performance.…”

Section: Resultsmentioning

confidence: 99%

“…Params. (M) GFLOPs Top-1 Acc (%) T2T-ViT t -14 (Yuan et al, 2021) 21.5 6.1 81.7 PVT-Small (Wang et al, 2021c) 24.5 3.8 79.8 TNT-S (Han et al, 2021a) 23.8 5.2 81.5 gMLP-S (Liu et al, 2021a) 20.0 4.5 79.6 Swin-T (Liu et al, 2021c) 28.3 4.5 81.3 PoolFormer-S24 (Yu et al, 2021) 21.4 3.6 80.3 ResMLP-24 (Touvron et al, 2021a) 30.0 6.0 79.4 Twins-SVT-S (Chu et al, 2021) 24.0 2.8 81.7 GFNet-S (Rao et al, 2021) 25.0 4.5 80.0 PVTv2-B2 (Wang et al, 2021a) 25.4 4.0 82.0 Focal-T (Yang et al, 2021) 29.1 4.9 82.2 ConvNeXt-T (Liu et al, 2022) 28 AdamW (Loshchilov and Hutter, 2018) optimizer, a total batch size of 16 on 8 GPUs. The initial learning rate is set as 1 × 10 −4 .…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

DMFormer: Closing the Gap Between CNN and Vision Transformers

Wei¹,

Pan²,

Niu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision transformers have shown excellent performance in computer vision tasks. However, the computation cost of their (local) self-attention mechanism is expensive. Comparatively, CNN is more efficient with built-in inductive bias. Recent works show that CNN is promising to compete with vision transformers by learning their architecture design and training protocols. Nevertheless, existing methods either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a novel attention mechanism named MCA, which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on MCA, we present a neural network named ConvFormer. ConvFormer adopts the general architecture of vision transformers, while replacing the (local) self-attention mechanism with our proposed MCA. Extensive experimental results demonstrated that ConvFormer achieves state-of-the-art performance on ImageNet classification, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs). Moreover, for object detection on COCO and semantic segmentation tasks on ADE20K, ConvFormer also shows excellent performance compared with recently advanced methods. Code and models will be available.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

DMFormer: Closing the Gap Between CNN and Vision Transformers

Wei¹,

Pan²,

Niu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, feedforward NN is a part of CNN architecture: these are placed after the initial convolution layers for final prediction. Researchers are recently revisiting simpler feedforward NNs (also called multi-layer perceptrons) for classification tasks (Touvron et al, 2021). We, therefore, selected a basic feedforward NN model to study vocal tract dynamics.…”

Section: Neural Network-based Classification Modelmentioning

confidence: 99%

“…Several studies proved that NN with many hidden layers outperforms GMM‐based models by a large margin in various speech recognition benchmarks (Hinton et al, 2012; Mohamed et al, 2012; Pan et al, 2012). Among several variants of NN architectures, feedforward NN is simple yet effective in pattern recognition tasks (Touvron et al, 2021). Moreover, other architectures—mostly derived from feedforward NN—might relegate vocal tract‐induced variabilities, but the effect of these variabilities on speech token classification was the main focus of this article.…”

Section: Introductionmentioning

confidence: 99%

Effect of vocal tract dynamics on neural network‐based speech recognition: A Bengali language‐based study

Hasan

Hossain

2022

Expert Systems

View full text Add to dashboard Cite

Although speech recognition has achieved significant success using integrated and efficient models, still some series of challenges remain as linguistic‐acoustic patterns are perturbed by speakers' individual articulation gestures and environmental noises. Due to dynamic changes in the vocal tract cavity, word utterances yield temporal and perturbed linguistic‐acoustic features, whereas vowel utterances yield less‐perturbed quasi‐stationary features. To recognize patterns as in vowels and words, the basic feedforward neural network (NN), among other methods, responds to these vocal tract‐induced variabilities and has shown promising results because of its simple yet effective modelling of nonlinear data. We, therefore, present a comprehensive study on how these variabilities of acoustical features affect the speech token classification performances using NNs. We chose vocal tract resonance (formant frequency) as linguistic‐acoustic feature. Our statistical evaluation of vocal tract‐induced variabilities in seven Bengali vowels and words revealed that words have more variations than vowels. We used four‐fold cross‐validation in an NN with Adam optimizer to compute classification performances using five different metrics. Our experiments found that formant transitions and dispersions do not contribute to classification, and five‐hidden‐layered NN is optimum. In all different test cases, we justified our hypothesis—word classification falls behind vowel classification due to the variability induced by vocal tract dynamics. The optimum NN with 28,263 trainable parameters achieved the highest accuracy and AUC scores: 0.89 and 0.99 in vowels, and 0.64 and 0.91 in words.

show abstract

“…where Norm(•) means the normalization such as Layer Normalization [1]; TokenMixer(•) denotes a module mainly working for communicating information among tokens. In recently works, vision transformer models [31,44,45] and spatial MLP in MLP-like models [46,47] have implemented various kinds of attention mechanism, which aims at mainly propagating token information and some mixing channels, like attention. e second subblock includes a two-layered MLP with nonlinear activation function,…”

Section: Forgery-detection-with-facial-detail Transformermentioning

confidence: 99%

FD2Foremer: Thinking Face Forgery Detection in Midfrequency Geometry Details

Shen

2022

Security and Communication Networks

View full text Add to dashboard Cite

Face forgery by DeepFake has caused widespread concern in community because of the synthesized media’s risks to the society. However, advances in recent years have been able to produce synthetic images indistinguishable from real images in the RGB space. Extracting midfrequency facial geometry details, including person-specific details and dynamic expression-dependent ones on facial geometry surfaces, is a promising way to highlight forgery clues during face forgery detection. In this paper, we use 3D face reconstruction to generate the displacement map from a single input face image, which is able to represent middle and fine scale details by indicating signed distance from the point in UV space. The cropped face images can also provide eyes and mouse information, so we use face image and its displacement map to extract the image features. Besides, we save the computation cost and maintain competitive performance using a universal transformer architecture and introduce a manifold distillation strategy to train our model from a more complex transformer backbone. Extensive experiments on various public DeepFake datasets indicate the effectiveness of the extracted facial geometry details, and proposed method achieves competitive performance.

show abstract

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Cited by 520 publications

References 49 publications

DMFormer: Closing the Gap Between CNN and Vision Transformers

DMFormer: Closing the Gap Between CNN and Vision Transformers

Effect of vocal tract dynamics on neural network‐based speech recognition: A Bengali language‐based study

FD2Foremer: Thinking Face Forgery Detection in Midfrequency Geometry Details

Contact Info

Product

Resources

About