Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Changpinyo, Soravit; Sharma, Piyush; Ding, Nan; Soricut, Radu

doi:10.1109/cvpr46437.2021.00356

Cited by 324 publications

(225 citation statements)

References 25 publications

Supporting

Mentioning

193

Contrasting

Unclassified

Order By: Relevance

“…All models show very low accuracy across all skills with the zeroshot setting. This is because of the domain gap (e.g., background color, object textures) between PAINTSKILLS and pretraining images [11,36,50]. We provide the zero-shot image generation samples in the appendix.…”

Section: Visual Reasoning Skill Resultsmentioning

confidence: 99%

“…13 A VQGAN [18] pretrained on ImageNet [16] is used as the dVAE. The transformer is trained on 15M image-text pairs from Conceptual Captions [11,50]. 14 ruDALL-E-XL (Malevich).…”

Section: Evaluated Modelsmentioning

confidence: 99%

“…28 A VQGAN [18] pretrained on ImageNet [16] is used as the dVAE, which compresses 256x256 RGB images into a 16x16=256 grid of image tokens, with codebook size 1024. The transformer has 16 attention blocks and is trained on 15M image-text pairs from Conceptual Captions [11,50]. 29 Following the default implementation, we use naive sampling without top-k / top-p filtering.…”

Section: F Models Detailsmentioning

confidence: 99%

See 2 more Smart Citations

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Cho¹,

Zala²,

Bansal³

2022

Preprint

View full text Add to dashboard Cite

Text-to-Image Generative ModelFigure 1. Overview of our evaluation process for text-to-image models. We propose to evaluate models in four ways: visual reasoning skills (Sec. 4.1), image-text alignment (Sec. 4.2), image quality (Sec. 4.3), and social biases (Sec. 4.4). Images in the figure are generated using ruDALL-E-XL. We also conduct human evaluation to verify our model-based visual reasoning, image-text alignment, and social bias evaluations.

show abstract

Section: Visual Reasoning Skill Resultsmentioning

confidence: 99%

“…13 A VQGAN [18] pretrained on ImageNet [16] is used as the dVAE. The transformer is trained on 15M image-text pairs from Conceptual Captions [11,50]. 14 ruDALL-E-XL (Malevich).…”

Section: Evaluated Modelsmentioning

confidence: 99%

Section: F Models Detailsmentioning

confidence: 99%

See 1 more Smart Citation

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Cho¹,

Zala²,

Bansal³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…On the other hand, pre-training models on online collected data (such as alt-texts from the HTML pages) has shown promising results. CC3M (Sharma et al, 2018), CC12M (Changpinyo et al, 2021) and YFCC100M (Thomee et al, 2016) have millions of image-text pairs in English generated by an online data collection pipeline including image and text filters, as well as text transformations. VLP models on these datasets have shown to be effective in multiple downstream tasks.…”

Section: Vision-language Datasetsmentioning

confidence: 99%

“…In this way, we collect a total of 166 million raw <image, text> pairs. Then following common practices (Sharma et al, 2018;Changpinyo et al, 2021;, we apply a series of filtering strategies described in the below section to construct the final Wukong dataset. Figure 2 shows some samples within our dataset.…”

Section: Dataset Collectionmentioning

confidence: 99%

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Gu¹,

Meng²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods to facilitate the Vision-Language Pretraining (VLP) research and community development. Recent dual-stream VLP models like CLIP, ALIGN and FILIP have shown remarkable performance on various downstream tasks as well as their remarkable zero-shot ability in the open domain tasks. However, their success heavily relies on the scale of pre-trained datasets. Though there have been both small-scale vision-language English datasets like Flickr30k, CC12M as well as large-scale LAION-400M, the current community lacks large-scale Vision-Language benchmarks in Chinese, hindering the development of broader multilingual applications. On the other hand, there is very rare publicly available large-scale Chinese cross-modal pre-training dataset that has been released, making it hard to use pre-trained models as services for downstream tasks. In this work, we release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web. Furthermore, we release a group of big models pre-trained with advanced image encoders (ResNet/ViT/SwinT) and different pre-training methods (CLIP/FILIP/LiT). We provide extensive experiments, a deep benchmarking of different downstream tasks, and some exciting findings. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods, which gives superior performance on various downstream tasks such as zero-shot image classification and image-text retrieval benchmarks. More information can refer to https://wukong-dataset.github.io/wukong-dataset/.

show abstract

An Outbreak of Fungal Endophthalmitis After Cataract Surgery in South Korea

Kim

Choi

et al. 2023

JAMA Ophthalmol

View full text Add to dashboard Cite

ImportanceFungal endophthalmitis caused by contaminated medical products is extremely rare; it follows an intractable clinical course with a poor visual prognosis.ObjectiveTo report the epidemiologic and clinical features and treatment outcomes of a nationwide fungal endophthalmitis outbreak after cataract surgery as a result of contaminated viscoelastic agents in South Korea.Design, Setting, and ParticipantsThis was a retrospective case series analysis of clinical data from multiple institutions in South Korea conducted from September 1, 2020, to October 31, 2021. Data were collected through nationwide surveys in May and October 2021 from the 100 members of the Korean Retinal Society. Patients were diagnosed with fungal endophthalmitis resulting from the use of the viscoelastic material sodium hyaluronate (Unial [Unimed Pharmaceutical Inc]). Data were analyzed from November 1, 2021, to May 30, 2022.Main Outcomes and MeasuresThe clinical features and causative species were identified, and treatment outcomes were analyzed for patients who underwent 6 months of follow-up.ResultsThe fungal endophthalmitis outbreak developed between September 1, 2020, and June 30, 2021, and peaked in November 2020. An official investigation by the Korea Disease Control and Prevention Agency confirmed contamination of viscoelastic material. All 281 eyes of 265 patients (mean [SD] age, 65.4 [10.8] years; 153 female individuals [57.7%]) were diagnosed with fungal endophthalmitis, based on clinical examinations and supportive culture results. The mean (SD) time period between cataract surgery and diagnosis was 24.7 (17.3) days. Patients exhibited characteristic clinical features of fungal endophthalmitis, including vitreous opacity (212 of 281 [75.4%]), infiltration into the intraocular lens (143 of 281 [50.9%]), and ciliary infiltration (55 of 281 [19.6%]). Cultures were performed in 260 eyes, and fungal presence was confirmed in 103 eyes (39.6%). Among them, Fusarium species were identified in 89 eyes (86.4%). Among the 228 eyes included in the treatment outcome analysis, the mean (SD) best-corrected visual acuity improved from 0.78 (0.74) logMAR (Snellen equivalent, 20/120 [7.3 lines]) to 0.36 (0.49) logMAR (Snellen equivalent, 20/45 [4.9 lines]) at 6 months. Furthermore, disease remission with no signs of fungal endophthalmitis (or cells in the anterior chamber milder than grade 1) was noted in 214 eyes (93.9%).Conclusions and RelevanceThis was a retrospectively reviewed case series of a fungal endophthalmitis outbreak resulting from contaminated viscoelastic material. Findings of this case series study support the potential benefit of prompt, aggressive surgical intervention that may reduce treatment burden and improve prognosis of fungal endophthalmitis caused by contaminated medical products.

show abstract

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Cited by 324 publications

References 25 publications

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

An Outbreak of Fungal Endophthalmitis After Cataract Surgery in South Korea

Contact Info

Product

Resources

About