2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00356
|View full text |Cite
|
Sign up to set email alerts
|

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
193
0
2

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 324 publications
(225 citation statements)
references
References 25 publications
1
193
0
2
Order By: Relevance
“…All models show very low accuracy across all skills with the zeroshot setting. This is because of the domain gap (e.g., background color, object textures) between PAINTSKILLS and pretraining images [11,36,50]. We provide the zero-shot image generation samples in the appendix.…”
Section: Visual Reasoning Skill Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…All models show very low accuracy across all skills with the zeroshot setting. This is because of the domain gap (e.g., background color, object textures) between PAINTSKILLS and pretraining images [11,36,50]. We provide the zero-shot image generation samples in the appendix.…”
Section: Visual Reasoning Skill Resultsmentioning
confidence: 99%
“…13 A VQGAN [18] pretrained on ImageNet [16] is used as the dVAE. The transformer is trained on 15M image-text pairs from Conceptual Captions [11,50]. 14 ruDALL-E-XL (Malevich).…”
Section: Evaluated Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…On the other hand, pre-training models on online collected data (such as alt-texts from the HTML pages) has shown promising results. CC3M (Sharma et al, 2018), CC12M (Changpinyo et al, 2021) and YFCC100M (Thomee et al, 2016) have millions of image-text pairs in English generated by an online data collection pipeline including image and text filters, as well as text transformations. VLP models on these datasets have shown to be effective in multiple downstream tasks.…”
Section: Vision-language Datasetsmentioning
confidence: 99%
“…In this way, we collect a total of 166 million raw <image, text> pairs. Then following common practices (Sharma et al, 2018;Changpinyo et al, 2021;, we apply a series of filtering strategies described in the below section to construct the final Wukong dataset. Figure 2 shows some samples within our dataset.…”
Section: Dataset Collectionmentioning
confidence: 99%