LAION-5B: An open large-scale dataset for training next generation image-text models

Schuhmann, Christoph; Beaumont, Romain; Vencu, Richard; Gordon, Cade; Wightman, Ross; Cherti, Mehdi; Coombes, Theo; Aarush, Katta,; Mullis, C. H.; Wortsman, Mitchell; Schramowski, Patrick; Kundurthy, Srivatsa; Crowson, Katherine; Schmidt, Ludwig; Kaczmarczyk, Robert; Jitsev, Jenia

doi:10.48550/arxiv.2210.08402

Cited by 69 publications

(97 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…See Figure 1 for highlighted results. For example, our ViT-L/14 checkpoint achieves 78.5% zero-shot accuracy on ImageNet [16], surpassing all public checkpoints from CLIP [57] and OpenCLIP [29], including those with larger model size and trained on a much larger LAION-2B [59]. The new customized visual models also demonstrate higher few/full-shot performance than the original generic model counterparts.…”

Section: Introductionmentioning

confidence: 94%

“…The acquired knowledge typically contains richer information about the concept: relevant images that never appear in the downstream training and evaluation set, and richer text descriptions about concept semantics. Such multi-modal knowledge sources are generally available on the web, and further open-sourced like LAION [59,60]. They cover a variety of domains, making it possible to develop customized visual models for task-level transfer.…”

Section: Introductionmentioning

confidence: 99%

“…By continue pretraining on only 10M retrieved data, REACT outperforms all public CLIP/OpenCLIP checkpoints, including those with much larger model size and trained on the much larger LAION-2B dataset. LAION[59,60] is abbreviated as "L" in the table. Web-800M † : a privately collected web database with 800M image-text pairs.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Liu¹,

Son²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framework to acquire the relevant web knowledge to build customized visual models for target domains. We retrieve the most relevant image-text pairs (∼3% of CLIP pre-training data) from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights. The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings. Particularly, on the zero-shot classification task, compared with CLIP, it achieves up to 5.4% improvement on ImageNet and 3.7% on the ELEVATER benchmark (20 datasets).

show abstract

Section: Introductionmentioning

confidence: 94%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Liu¹,

Son²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Specifically, image features are extracted from ViT-B/32 and ViT-L/14 visual models of both the standard CLIP architecture ( , accessed on 15 January 2023) [ 41 ] and the open-source implementation of CLIP (i.e., OpenCLIP ( , accessed on 15 January 2023) [ 62 ]), which has been trained with a post-ensemble method for improving robustness to out-of-distribution samples. The OpenCLIP architectures employed in the experiments were trained on the LAION-2B composed of 2 billion image–text pairs obtained by filtering the English pairs of the LAION-5B dataset [ 63 ].…”

Section: Experimental Evaluationmentioning

confidence: 99%

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Moratelli

Barraco

Morelli

et al. 2023

Sensors

View full text Add to dashboard Cite

Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.

show abstract

“…Language Supervised Visual Pre-training learns visual representation from image-text pairs by solving generative [16,56] or discriminative [72] pretext tasks. Recently, benefit from modern scalable networks [21,41,42] and public available image-text datasets [8,17,[57][58][59], CLIP [47] and ALIGN [35] unveil the tremendous transferability and scalability of this paradigm. The core technique of CLIP is aligning both vision and language modalities in a joint embedding space by global representation contrastive.…”

Section: Related Workmentioning

confidence: 99%

RILS: Masked Visual Reconstruction in Language Semantic Space

Yang¹,

Ge²,

Yi³

et al. 2023

Preprint

View full text Add to dashboard Cite

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code will be made available at https://github.com/hustvl/RILS.

show abstract

LAION-5B: An open large-scale dataset for training next generation image-text models

Cited by 69 publications

References 51 publications

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

RILS: Masked Visual Reconstruction in Language Semantic Space

Contact Info

Product

Resources

About