LiT: Zero-Shot Transfer with Locked-image text Tuning

Zhai, Xiaohua; Wang, Xiao; Mustafa, Basil; Steiner, Andreas; Keysers, Daniel; Kolesnikov, Alexander; Beyer, Lucas

doi:10.1109/cvpr52688.2022.01759

Cited by 171 publications

(135 citation statements)

References 20 publications

Supporting

Mentioning

134

Contrasting

Order By: Relevance

“…The high-level idea is to learn a shared embedding space for both image and text, such that paired image and text stays close to each other, while unpaired image and text are distant from each other. The follow up work (Pham et al, 2021;Zhai et al, 2022b) studies the impact of the training data and batch size in contrastive learning. They observed that additional high quality data (Pham et al, 2021) or a pretrained vision model (Zhai et al, 2022b) can lead to better vision-language models, and a large batch size is generally beneficial to contrastive learning.…”

Section: Related Workmentioning

confidence: 99%

“…The follow up work (Pham et al, 2021;Zhai et al, 2022b) studies the impact of the training data and batch size in contrastive learning. They observed that additional high quality data (Pham et al, 2021) or a pretrained vision model (Zhai et al, 2022b) can lead to better vision-language models, and a large batch size is generally beneficial to contrastive learning. Furthermore, Zhai et al (2022b) show that with a pretrained and locked vision model, one needs to train only a paired text encoder model to get good language embeddings.…”

Section: Related Workmentioning

confidence: 99%

“…They observed that additional high quality data (Pham et al, 2021) or a pretrained vision model (Zhai et al, 2022b) can lead to better vision-language models, and a large batch size is generally beneficial to contrastive learning. Furthermore, Zhai et al (2022b) show that with a pretrained and locked vision model, one needs to train only a paired text encoder model to get good language embeddings. extend contrastively pretrained models to more downstream tasks, including object detection and video recognition tasks with task-specific adaptations.…”

Section: Related Workmentioning

confidence: 99%

“…WebLI scales up the image language data collection from English-only datasets to 109 languages, which enables us to pretrain PaLI multilingually, and perform downstream tasks across many languages. The data collection process is similar to those reported in (Jia et al, 2021;Zhai et al, 2022b). Due to the abundance of multilingual content on the internet, the collection process for the WebLI dataset can be scaled to cover 10 billion images and 12 billion alt-texts.…”

Section: Datamentioning

confidence: 99%

“…For this, we perform supervised finetuning on standard classification tasks. Additionally, we perform LiT transfer (Zhai et al, 2022b) to evaluate the frozen representation quality in a zero-shot setup.…”

Section: Model Scalingmentioning

confidence: 99%

See 4 more Smart Citations

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen¹,

Wang²,

Changpinyo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Effective scaling and a flexible task interface enable large language models to excel at many tasks. PaLI (Pathways Language and Image model) extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pretrained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-ofthe-art in multiple vision and language tasks (such as captioning, visual questionanswering, scene-text understanding), while retaining a simple, modular, and scalable design.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Datamentioning

confidence: 99%

Section: Model Scalingmentioning

confidence: 99%

See 3 more Smart Citations

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Chen¹,

Wang²,

Changpinyo³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Key Technology of Automation Control Based on Artificial Intelligence Technology

Wang

2021

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github. com/alibaba-mmai-research/MoLo.

show abstract

Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks

Almazán¹,

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Strong image search models can be learned for a specific domain, i.e. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from these various retrieval tasks. This is the more practical scenario that we consider in this paper. We address it with the proposed Grappa, an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently, using only unlabeled images from the different task domains. We extend the pretrained model with multiple independently trained sets of adaptors that use pseudo-label sets of different sizes, effectively mimicking different pseudo-granularities. We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers that we guide by propagating pseudo-granularity attentions across neighbors in the feature space. Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model, and in some places reaches or improves over a task labelaware oracle that selects the most fitting pseudo-granularity per task.

show abstract

LiT: Zero-Shot Transfer with Locked-image text Tuning

Cited by 171 publications

References 20 publications

PaLI: A Jointly-Scaled Multilingual Language-Image Model

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Key Technology of Automation Control Based on Artificial Intelligence Technology

Granularity-Aware Adaptation for Image Retrieval Over Multiple Tasks

Contact Info

Product

Resources

About