Alaaeldin El-Nouby scite author profile

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.Preprint. Under review.

show abstract

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

Graham¹,

El-Nouby²,

Touvron³

et al. 2021

394

185

View full text Add to dashboard Cite

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

Touvron

Bojanowski²,

Caron³

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

291

180

View full text Add to dashboard Cite

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community 1 .

show abstract

Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction

et al. 2019

View full text Add to dashboard Cite

Conditional text-to-image generation is an active area of research, with many possible applications. Existing research has primarily focused on generating a single image from available conditioning information in one step. One practical extension beyond one-step generation is a system that generates an image iteratively, conditioned on ongoing linguistic input or feedback. This is significantly more challenging than one-step generation tasks, as such a system must understand the contents of its generated images with respect to the feedback history, the current feedback, as well as the interactions among concepts present in the feedback history. In this work, we present a recurrent image generation model which takes into account both the generated output up to the current step as well as all past instructions for generation. We show that our model is able to generate the background, add new objects, and apply simple transformations to existing objects. We believe our approach is an important step toward interactive generation. Code and data is available at: https://www.microsoft.com/en-us/research/ project/generative-neural-visual-artist-geneva/.

show abstract

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

Graham¹,

El-Nouby²,

Touvron³

et al. 2021

Preprint

View full text Add to dashboard Cite

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

El-Nouby¹,

Izacard²,

Touvron³

et al. 2021

Preprint

View full text Add to dashboard Cite

Three things everyone should know about Vision Transformers

Touvron¹,

Cord²,

El-Nouby³

et al. 2022

Preprint

View full text Add to dashboard Cite

Augmenting Convolutional networks with attention-based aggregation

Touvron¹,

Cord²,

El-Nouby³

et al. 2021

Preprint

View full text Add to dashboard Cite

We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attentionbased aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.