2022
DOI: 10.48550/arxiv.2212.07143
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reproducible scaling laws for contrastive language-image learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 20 publications
(19 citation statements)
references
References 0 publications
0
19
0
Order By: Relevance
“…These experiments are motivated by our initial use of the Merged-38M pre-trained representation for LVIS val set evaluation, which resulted in unintended use of unlabeled images from the development / test set for MIM pre-training, similar to the issue raised in [60]. [30] also reports a small percentage of images from IN-1K along with its variants, Flickr30K and COCO were detected in the LAION-400M dataset. This data contamination issue raises concerns about the validity of downstream benchmarks when a large number of unlabeled images are used for pre-training.…”
Section: A3 Data Contamination In Mim Pre-training: a Case Studymentioning
confidence: 99%
See 1 more Smart Citation
“…These experiments are motivated by our initial use of the Merged-38M pre-trained representation for LVIS val set evaluation, which resulted in unintended use of unlabeled images from the development / test set for MIM pre-training, similar to the issue raised in [60]. [30] also reports a small percentage of images from IN-1K along with its variants, Flickr30K and COCO were detected in the LAION-400M dataset. This data contamination issue raises concerns about the validity of downstream benchmarks when a large number of unlabeled images are used for pre-training.…”
Section: A3 Data Contamination In Mim Pre-training: a Case Studymentioning
confidence: 99%
“…Recent research advancements have led to a surge of interest in scaling up vision [81,44,124,17] as well as vision-language [140,123,30,139] representations. These efforts are driven by the belief that increasing the number of parameters, data, and compute budgets will ultimately result in improved performance [63,142,134,93].…”
Section: Introductionmentioning
confidence: 99%
“…Using this standard view as a basis, we leverage 8 azimuth angles (0 • , 45 , 315 • ) and 3 elevation angles (-30 • , 0 • , 30 • ) to render 24 rendered images. To address the subjective and non-reproducible nature of user studies, we use an automatic expert model [7] 1 trained on LAION-400M [52] for evaluation. Based on these 24 rendered images and the expert model, we propose two automatic evaluation metrics.…”
Section: Benchmarks and Metricsmentioning
confidence: 99%
“…Our approach utilized the CLIP architecture, which was initially proposed for the image-to-text task. Two different datasets were utilized for the training process in two public implementations -OpenAI's CLIP [10] and OpenCLIP [30]. The former involved pre-training the model on Imagenet22K, with ViT-L being the best-performing model.…”
Section: The Proposed Approach a Model Architecturementioning
confidence: 99%