2022
DOI: 10.48550/arxiv.2205.01917
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CoCa: Contrastive Captioners are Image-Text Foundation Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
117
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 110 publications
(168 citation statements)
references
References 0 publications
0
117
0
1
Order By: Relevance
“…Language models are trained on text only corpus significantly larger than paired image-text data, thus being exposed to a very rich and wide distribution of text. These models are also generally much larger than text encoders in current image-text models [49,31,80] (e.g. PaLM [11] has 540B parameters, while CoCa [80] has a ≈ 1B parameter text encoder).…”
Section: Pretrained Text Encodersmentioning
confidence: 99%
See 1 more Smart Citation
“…Language models are trained on text only corpus significantly larger than paired image-text data, thus being exposed to a very rich and wide distribution of text. These models are also generally much larger than text encoders in current image-text models [49,31,80] (e.g. PaLM [11] has 540B parameters, while CoCa [80] has a ≈ 1B parameter text encoder).…”
Section: Pretrained Text Encodersmentioning
confidence: 99%
“…These models are also generally much larger than text encoders in current image-text models [49,31,80] (e.g. PaLM [11] has 540B parameters, while CoCa [80] has a ≈ 1B parameter text encoder).…”
Section: Pretrained Text Encodersmentioning
confidence: 99%
“…ImageNet-M example selection method. The ViT-3B model made 155 "major" mistakes, for which we analyzed whether each example was labeled correctly for three additional models: (1) the Greedy Soups model, (2) a model pre-trained on Instagram data but fine-tuned on ImageNet that achieves 85.4% top-1 [19], and (3) A zero-shot evaluation [28,15,27] using a CoCa [42] model pretrained on JFT and noisy image-text data. In order to maximize prediction diversity, we purposefully selected models with varying pre-training data and training methodologies, including a zero-shot model that does not see ImageNet image-label associations directly [10].…”
Section: Imagenet-m: a "Major Mistakes" Evaluation Splitmentioning
confidence: 99%
“…How do models that were not used to select this dataset perform? We evaluate the suite of 70 models from Shankar et al [31] on this dataset, in addition to four recent top models not directly used to help filter the ImageNet-M set: a ViT-G/14 model [44] (90.5% top-1), a BASIC model [27] fine-tuned on ImageNet (90.7% top-1), an ALIGN model [15] fine-tuned on ImageNet (88.1% top-1), and a CoCa model [42] fine-tuned on ImageNet (91.0% top-1). The plot shown here shows that most models as far back as AlexNet through ResNets get between 10-25 examples correct, but recent high accuracy models such as ViT-G/14, BASIC-FT, and CoCa-FT are starting to solve more of these 'major' mistakes: CoCa-FT gets 42 of the 68 examples correct.…”
Section: Imagenet-m: a "Major Mistakes" Evaluation Splitmentioning
confidence: 99%
“…On the other hand, label supervision offers to learn more targeted visual representations that are label-oriented and can cover rare categories. To gain the complementary advantages of both kinds of supervision for contrastive image-caption pre-training, recent works [43,46] have proposed to convert class labels into a sentence with pre-defined templates called prompts. However, a naive unification of the real caption and the prompt sentences could lead to a complication in learning, as the distribution shift in text may not be handled properly in the language encoder.…”
mentioning
confidence: 99%