CoCa: Contrastive Captioners are Image-Text Foundation Models

Yu, Jiahui; Wang, Zirui; Vasudevan, Vijay K.; Yeung, Legg; Seyedhosseini, Mojtaba; Wu, Yangjie

doi:10.48550/arxiv.2205.01917

Cited by 110 publications

(168 citation statements)

References 0 publications

Supporting

Mentioning

117

Contrasting

Unclassified

Order By: Relevance

“…Language models are trained on text only corpus significantly larger than paired image-text data, thus being exposed to a very rich and wide distribution of text. These models are also generally much larger than text encoders in current image-text models [49,31,80] (e.g. PaLM [11] has 540B parameters, while CoCa [80] has a ≈ 1B parameter text encoder).…”

Section: Pretrained Text Encodersmentioning

confidence: 99%

See 1 more Smart Citation

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia¹,

Chan²,

Saxena³

et al. 2022

Preprint

222

312

View full text Add to dashboard Cite

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment. See imagen.research.google for an overview of the results. * Equal contribution. † Core contribution.

show abstract

Section: Pretrained Text Encodersmentioning

confidence: 99%

“…These models are also generally much larger than text encoders in current image-text models [49,31,80] (e.g. PaLM [11] has 540B parameters, while CoCa [80] has a ≈ 1B parameter text encoder).…”

Section: Pretrained Text Encodersmentioning

confidence: 99%

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Saharia¹,

Chan²,

Saxena³

et al. 2022

Preprint

222

312

View full text Add to dashboard Cite

show abstract

“…ImageNet-M example selection method. The ViT-3B model made 155 "major" mistakes, for which we analyzed whether each example was labeled correctly for three additional models: (1) the Greedy Soups model, (2) a model pre-trained on Instagram data but fine-tuned on ImageNet that achieves 85.4% top-1 [19], and (3) A zero-shot evaluation [28,15,27] using a CoCa [42] model pretrained on JFT and noisy image-text data. In order to maximize prediction diversity, we purposefully selected models with varying pre-training data and training methodologies, including a zero-shot model that does not see ImageNet image-label associations directly [10].…”

Section: Imagenet-m: a "Major Mistakes" Evaluation Splitmentioning

confidence: 99%

“…How do models that were not used to select this dataset perform? We evaluate the suite of 70 models from Shankar et al [31] on this dataset, in addition to four recent top models not directly used to help filter the ImageNet-M set: a ViT-G/14 model [44] (90.5% top-1), a BASIC model [27] fine-tuned on ImageNet (90.7% top-1), an ALIGN model [15] fine-tuned on ImageNet (88.1% top-1), and a CoCa model [42] fine-tuned on ImageNet (91.0% top-1). The plot shown here shows that most models as far back as AlexNet through ResNets get between 10-25 examples correct, but recent high accuracy models such as ViT-G/14, BASIC-FT, and CoCa-FT are starting to solve more of these 'major' mistakes: CoCa-FT gets 42 of the 68 examples correct.…”

Section: Imagenet-m: a "Major Mistakes" Evaluation Splitmentioning

confidence: 99%

When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

Vasudevan¹,

Caine²,

Gontijo-Lopes³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Image classification accuracy on the ImageNet dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community [33,4,31,43,37], yet innovations continue to contribute gains to performance, with today's largest models achieving 90%+ top-1 accuracy. To help contextualize progress on ImageNet and provide a more meaningful evaluation for today's stateof-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of ImageNet, where today's best models achieve upwards of 97% top-1 accuracy. Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels, demonstrating that, without careful review, we are significantly underestimating the performance of these models. On the other hand, we also find that today's best models still make a significant number of mistakes (40%) that are obviously wrong to human reviewers. To calibrate future progress on ImageNet, we provide an updated multilabel evaluation set, and we curate ImageNet-Major: a 68-example "major error" slice of the obvious mistakes made by today's top models-a slice where models should achieve near perfection, but today are far from doing so.Preprint. Under review. All authors contributed equally. Data and analysis at https://github.com/ google-research/imagenet-mistakes

show abstract

“…On the other hand, label supervision offers to learn more targeted visual representations that are label-oriented and can cover rare categories. To gain the complementary advantages of both kinds of supervision for contrastive image-caption pre-training, recent works [43,46] have proposed to convert class labels into a sentence with pre-defined templates called prompts. However, a naive unification of the real caption and the prompt sentences could lead to a complication in learning, as the distribution shift in text may not be handled properly in the language encoder.…”

mentioning

confidence: 99%

Prefix Conditioning Unifies Language and Label Supervision

Sohn¹,

Zhang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision-language contrastive learning suggests a new learning paradigm by leveraging a large amount of image-caption-pair data. The caption supervision excels at providing wide coverage in vocabulary that enables strong zero-shot image recognition performance. On the other hand, label supervision offers to learn more targeted visual representations that are label-oriented and can cover rare categories. To gain the complementary advantages of both kinds of supervision for contrastive image-caption pre-training, recent works [43,46] have proposed to convert class labels into a sentence with pre-defined templates called prompts. However, a naive unification of the real caption and the prompt sentences could lead to a complication in learning, as the distribution shift in text may not be handled properly in the language encoder. In this work, we propose a simple yet effective approach to unify these two types of supervision using prefix tokens that inform a language encoder of the type of the input sentence (e.g., caption or prompt) at training time. Our method is generic and can be easily integrated into existing VL pre-training objectives such as CLIP or UniCL. In experiments, we show that this simple technique dramatically improves the performance in zero-shot image recognition accuracy of the pre-trained model.

show abstract

CoCa: Contrastive Captioners are Image-Text Foundation Models

Cited by 110 publications

References 0 publications

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

When does dough become a bagel? Analyzing the remaining mistakes on ImageNet

Prefix Conditioning Unifies Language and Label Supervision

Contact Info

Product

Resources

About