Reproducible scaling laws for contrastive language-image learning

Cherti, Mehdi; Beaumont, Romain; Wightman, Ross; Wortsman, Mitchell; Ilharco, Gabriel; Gordon, Cade; Schuhmann, Christoph; Schmidt, Ludwig; Jitsev, Jenia

doi:10.48550/arxiv.2212.07143

Cited by 20 publications

(19 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These experiments are motivated by our initial use of the Merged-38M pre-trained representation for LVIS val set evaluation, which resulted in unintended use of unlabeled images from the development / test set for MIM pre-training, similar to the issue raised in [60]. [30] also reports a small percentage of images from IN-1K along with its variants, Flickr30K and COCO were detected in the LAION-400M dataset. This data contamination issue raises concerns about the validity of downstream benchmarks when a large number of unlabeled images are used for pre-training.…”

Section: A3 Data Contamination In Mim Pre-training: a Case Studymentioning

confidence: 99%

See 1 more Smart Citation

EVA-02: A Visual Representation for Neon Genesis

Fang¹,

Sun²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ∼1/6 parameters and ∼1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community. * Notice that the scale of each axis in the radar chart is normalized by the performance of EVA, and the stride of each axis are the same

show abstract

Section: A3 Data Contamination In Mim Pre-training: a Case Studymentioning

confidence: 99%

“…Recent research advancements have led to a surge of interest in scaling up vision [81,44,124,17] as well as vision-language [140,123,30,139] representations. These efforts are driven by the belief that increasing the number of parameters, data, and compute budgets will ultimately result in improved performance [63,142,134,93].…”

Section: Introductionmentioning

confidence: 99%

EVA-02: A Visual Representation for Neon Genesis

Fang¹,

Sun²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Using this standard view as a basis, we leverage 8 azimuth angles (0 • , 45 , 315 • ) and 3 elevation angles (-30 • , 0 • , 30 • ) to render 24 rendered images. To address the subjective and non-reproducible nature of user studies, we use an automatic expert model [7] 1 trained on LAION-400M [52] for evaluation. Based on these 24 rendered images and the expert model, we propose two automatic evaluation metrics.…”

Section: Benchmarks and Metricsmentioning

confidence: 99%

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Ma¹,

Zhang²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior methods adopt text-independent multilayer perceptrons (MLPs) to predict the attributes of the target mesh with the supervision of CLIP loss. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to unsatisfactory stylization and slow convergence. To address these limitations, we present X-Mesh, an innovative text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module (TDAM). The TDAM dynamically integrates the guidance of the target text by utilizing textrelevant spatial and channel-wise attentions during vertex feature extraction, resulting in more accurate attribute prediction and faster convergence speed. Furthermore, existing works lack standard benchmarks and automated metrics for evaluation, often relying on subjective and nonreproducible user studies to assess the quality of stylized 3D assets. To overcome this limitation, we introduce a new standard text-mesh benchmark, namely MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons. Our extensive qualitative and quantitative experiments demonstrate that X-Mesh outperforms previous state-of-the-art methods. Our codes and results are available at our project webpage: https://xmu-xiaoma666.github.io/ Projects/X-Mesh/ * Corresponding author; ‡ Equal contributions. Neural Style NetworkSteve Jobs in a red sweater, blue jeans, brown leather shoes and colorful gloves . X-MeshSteve Jobs in a red sweater, blue jeans, brown leather shoes and colorful gloves .

show abstract

“…Our approach utilized the CLIP architecture, which was initially proposed for the image-to-text task. Two different datasets were utilized for the training process in two public implementations -OpenAI's CLIP [10] and OpenCLIP [30]. The former involved pre-training the model on Imagenet22K, with ViT-L being the best-performing model.…”

Section: The Proposed Approach a Model Architecturementioning

confidence: 99%

Universal Image Embedding: Retaining and Expanding Knowledge With Multi-Domain Fine-Tuning

et al. 2023

View full text Add to dashboard Cite

The overall purpose of this study is to propose a novel fine-tuning method for the CLIP architecture that enables the retention of pre-existing knowledge from large datasets and the creation of a domain-agnostic image encoder for universal image embedding, addressing the challenge of transferring knowledge from source to target tasks using deep learning models. The basic design of the study involves applying the proposed method directly (without fine-tuning) to a wide range of instance retrieval and recognition tasks to evaluate its effectiveness. The study's major findings indicate that the proposed method significantly enhances performance on unseen domains without requiring separate fine-tuning for each domain. The authors' success in the Google Universal Image Embedding competition, where they were awarded a Gold medal out of 1200 teams, inspired their proposed method. These results have significant implications for real-life applications where multiple domains are common. In conclusion, the study offers a practical solution for transfer learning that addresses the challenges of dealing with multiple domains and advances deep learning, potentially inspiring further research in this area and driving progress in the field.

show abstract

Reproducible scaling laws for contrastive language-image learning

Cited by 20 publications

References 0 publications

EVA-02: A Visual Representation for Neon Genesis

EVA-02: A Visual Representation for Neon Genesis

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Universal Image Embedding: Retaining and Expanding Knowledge With Multi-Domain Fine-Tuning

Contact Info

Product

Resources

About