2023
DOI: 10.1109/tpami.2022.3206148
|View full text |Cite
|
Sign up to set email alerts
|

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community 1… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
180
0
2

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 520 publications
(483 citation statements)
references
References 49 publications
0
180
0
2
Order By: Relevance
“…Despite the simple architecture design, ConvFormer attains consistent improvement over other state-of-the-art models. For example, ConvFormer-S outperform ResMLP-24 (Touvron et al, 2021a) by 3.4% top-1 accuracy, while requiring less parameters (30.0 M→ 26.7 M) and computation cost (6.0 G→ 5.0 G FLOPs). When compared with recently well-established ViTs like Swin-S (Liu et al, 2021b) and Focal-S (Yang et al, 2021), ConvFormer also shows better performance.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Despite the simple architecture design, ConvFormer attains consistent improvement over other state-of-the-art models. For example, ConvFormer-S outperform ResMLP-24 (Touvron et al, 2021a) by 3.4% top-1 accuracy, while requiring less parameters (30.0 M→ 26.7 M) and computation cost (6.0 G→ 5.0 G FLOPs). When compared with recently well-established ViTs like Swin-S (Liu et al, 2021b) and Focal-S (Yang et al, 2021), ConvFormer also shows better performance.…”
Section: Resultsmentioning
confidence: 99%
“…Params. (M) GFLOPs Top-1 Acc (%) T2T-ViT t -14 (Yuan et al, 2021) 21.5 6.1 81.7 PVT-Small (Wang et al, 2021c) 24.5 3.8 79.8 TNT-S (Han et al, 2021a) 23.8 5.2 81.5 gMLP-S (Liu et al, 2021a) 20.0 4.5 79.6 Swin-T (Liu et al, 2021c) 28.3 4.5 81.3 PoolFormer-S24 (Yu et al, 2021) 21.4 3.6 80.3 ResMLP-24 (Touvron et al, 2021a) 30.0 6.0 79.4 Twins-SVT-S (Chu et al, 2021) 24.0 2.8 81.7 GFNet-S (Rao et al, 2021) 25.0 4.5 80.0 PVTv2-B2 (Wang et al, 2021a) 25.4 4.0 82.0 Focal-T (Yang et al, 2021) 29.1 4.9 82.2 ConvNeXt-T (Liu et al, 2022) 28 AdamW (Loshchilov and Hutter, 2018) optimizer, a total batch size of 16 on 8 GPUs. The initial learning rate is set as 1 × 10 −4 .…”
Section: Methodsmentioning
confidence: 99%
“…Moreover, feedforward NN is a part of CNN architecture: these are placed after the initial convolution layers for final prediction. Researchers are recently revisiting simpler feedforward NNs (also called multi-layer perceptrons) for classification tasks (Touvron et al, 2021). We, therefore, selected a basic feedforward NN model to study vocal tract dynamics.…”
Section: Neural Network-based Classification Modelmentioning
confidence: 99%
“…Several studies proved that NN with many hidden layers outperforms GMM‐based models by a large margin in various speech recognition benchmarks (Hinton et al, 2012; Mohamed et al, 2012; Pan et al, 2012). Among several variants of NN architectures, feedforward NN is simple yet effective in pattern recognition tasks (Touvron et al, 2021). Moreover, other architectures—mostly derived from feedforward NN—might relegate vocal tract‐induced variabilities, but the effect of these variabilities on speech token classification was the main focus of this article.…”
Section: Introductionmentioning
confidence: 99%
“…where Norm(•) means the normalization such as Layer Normalization [1]; TokenMixer(•) denotes a module mainly working for communicating information among tokens. In recently works, vision transformer models [31,44,45] and spatial MLP in MLP-like models [46,47] have implemented various kinds of attention mechanism, which aims at mainly propagating token information and some mixing channels, like attention. e second subblock includes a two-layered MLP with nonlinear activation function,…”
Section: Forgery-detection-with-facial-detail Transformermentioning
confidence: 99%