2021
DOI: 10.48550/arxiv.2111.12233
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scaling Up Vision-Language Pre-training for Image Captioning

Abstract: In recent years, we have witnessed significant performance boost in the image captioning task based on visionlanguage pre-training (VLP). Scale is believed to be an important factor for this advance. However, most existing work only focuses on pre-training transformers with moderate sizes (e.g., 12 or 24 layers) on roughly 4 million images. In this paper, we present LEMON , a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning. We use the state… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(20 citation statements)
references
References 49 publications
0
20
0
Order By: Relevance
“…Table 1 shows the results on zero-shot image captioning. For a comprehensive comparison, we also include the results of several representative (1) supervised methods: BUTD [3], GVD [93], UniVLP [94], ClipCap [51], Oscar [40], and LEMON [26]; and (2) weakly supervised methods: UIC [19], IC-SME [37], S2S-SS and S2S-GCC [25].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Table 1 shows the results on zero-shot image captioning. For a comprehensive comparison, we also include the results of several representative (1) supervised methods: BUTD [3], GVD [93], UniVLP [94], ClipCap [51], Oscar [40], and LEMON [26]; and (2) weakly supervised methods: UIC [19], IC-SME [37], S2S-SS and S2S-GCC [25].…”
Section: Resultsmentioning
confidence: 99%
“…Beyond describing the whole image scene, dense captioning methods [30,5,35,88,89] aim to describe the visual objects in a sub-region of the input image. Recently, vision-language pre-training methods [94,40,51,26], benefiting from the rich visual-textual representation of pre-trained models on large-scale datasets, are tendencies for vision-language generation by re-training or fine-tuning the model parameters on downstream tasks. Although these methods have achieved impressive results, a certain amount of paired image-text data is indispensable during training.…”
Section: Image Captioningmentioning
confidence: 99%
“…Despite RNN-based language models have been the standard strategy for generating the caption, convolutional language models [40] and fully-attentive language models [14], [41], [42], [43], [44] based on the Transformer paradigm [45] have been explored for image captioning, also motivated by the success of these approaches on Natural Language Processing tasks such as machine translation and language understandings [45], [46], [47]. Moreover, the introduction of Transformer-based language models has brought to the development of effective variants or modifications of the self-attention operator [7], [11], [12], [13], [48], [49], [8] and has enabled vision-and-language early-fusion [19], [22], [50], based on BERT-like architectures [46].…”
Section: Related Workmentioning
confidence: 99%
“…These are in the form of soft-labels, which the captioning model has to align with in the cross-entropy phase, and reweighting of the caption words to guide the fine-tuning phase. Additional improvement to the performance of recent selfattention-based image captioning approaches is due to the use of large-scale vision-and-language pre-training [17], [19], [20], [22], [50], which can be done on noisy image-text pairs, also exploiting pre-training losses different from cross-entropy, such as the masked token loss [19], [20]. Different from previous methods, our approach is based on the interplay of two different language models that are trained with the mean teacher learning paradigm and knowledge distillation, without relying on large-scale pre-training.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation