2022
DOI: 10.48550/arxiv.2210.08402
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LAION-5B: An open large-scale dataset for training next generation image-text models

Abstract: Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of training on large amounts of noisy image-text data, without relying on expensive accurate labels used in standard vision unimodal supervised learning. The resulting models showed capabilities of strong text-guided image generation and transfer to downstream tasks, while performing remarkably at zero-shot classification with noteworthy out-of-distribution robustness. Since then, large-scale language-vision models like ALIGN,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
82
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 69 publications
(97 citation statements)
references
References 51 publications
0
82
0
Order By: Relevance
“…See Figure 1 for highlighted results. For example, our ViT-L/14 checkpoint achieves 78.5% zero-shot accuracy on ImageNet [16], surpassing all public checkpoints from CLIP [57] and OpenCLIP [29], including those with larger model size and trained on a much larger LAION-2B [59]. The new customized visual models also demonstrate higher few/full-shot performance than the original generic model counterparts.…”
Section: Introductionmentioning
confidence: 94%
See 2 more Smart Citations
“…See Figure 1 for highlighted results. For example, our ViT-L/14 checkpoint achieves 78.5% zero-shot accuracy on ImageNet [16], surpassing all public checkpoints from CLIP [57] and OpenCLIP [29], including those with larger model size and trained on a much larger LAION-2B [59]. The new customized visual models also demonstrate higher few/full-shot performance than the original generic model counterparts.…”
Section: Introductionmentioning
confidence: 94%
“…The acquired knowledge typically contains richer information about the concept: relevant images that never appear in the downstream training and evaluation set, and richer text descriptions about concept semantics. Such multi-modal knowledge sources are generally available on the web, and further open-sourced like LAION [59,60]. They cover a variety of domains, making it possible to develop customized visual models for task-level transfer.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, image features are extracted from ViT-B/32 and ViT-L/14 visual models of both the standard CLIP architecture ( , accessed on 15 January 2023) [ 41 ] and the open-source implementation of CLIP (i.e., OpenCLIP ( , accessed on 15 January 2023) [ 62 ]), which has been trained with a post-ensemble method for improving robustness to out-of-distribution samples. The OpenCLIP architectures employed in the experiments were trained on the LAION-2B composed of 2 billion image–text pairs obtained by filtering the English pairs of the LAION-5B dataset [ 63 ].…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…Language Supervised Visual Pre-training learns visual representation from image-text pairs by solving generative [16,56] or discriminative [72] pretext tasks. Recently, benefit from modern scalable networks [21,41,42] and public available image-text datasets [8,17,[57][58][59], CLIP [47] and ALIGN [35] unveil the tremendous transferability and scalability of this paradigm. The core technique of CLIP is aligning both vision and language modalities in a joint embedding space by global representation contrastive.…”
Section: Related Workmentioning
confidence: 99%