Joan Puigcerver scite author profile

Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-ofthe-art in multiple vision and language tasks (such as captioning, visual questionanswering, scene-text understanding), while retaining a simple, modular, and scalable design.

show abstract

On Robustness and Transferability of Convolutional Neural Networks

Djolonga

Yung

Tschannen

et al. 2021

View full text Add to dashboard Cite

ICDAR2015 Competition on Keyword Spotting for Handwritten Documents

Puigcerver

Toselli

Vidal

2015

View full text Add to dashboard Cite

Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project

Bluche

Hamel

Kermorvant

et al. 2017

View full text Add to dashboard Cite

Scaling Vision with Sparse Mixture of Experts

Riquelme¹,

Puigcerver²,

Mustafa³

et al. 2021

Preprint

View full text Add to dashboard Cite

Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-ofthe-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.

show abstract

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joan Puigcerver

Big Transfer (BiT): General Visual Representation Learning

Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition?

Big Transfer (BiT): General Visual Representation Learning

PaLI: A Jointly-Scaled Multilingual Language-Image Model

On Robustness and Transferability of Convolutional Neural Networks

ICDAR2015 Competition on Keyword Spotting for Handwritten Documents

Preparatory KWS Experiments for Large-Scale Indexing of a Vast Medieval Manuscript Collection in the HIMANIS Project

Scaling Vision with Sparse Mixture of Experts

Contact Info

Product

Resources

About