XGPT: Cross-modal Generative Pre-Training for Image Captioning

Qin, Xia; Huang, Haoyang; Duan, Nan; Zhang, Dongdong; Ji, Lei; Sui, Zhifang; Cui, Edward; Bharti, Taroon; Zhou, Ming

doi:10.1007/978-3-030-88480-2_63

Cited by 33 publications

(31 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• When the input data 𝑥 is multimedia input such as image and speech, text generation becomes image caption [192] or speech recognition [43]. In image caption, we might expect the generated caption text to be vivid for attracting children, while in speech recognition, the transformed text must be faithful to the original speech.…”

Section: Text Generationmentioning

confidence: 99%

Pretrained Language Models for Text Generation: A Survey

Li¹,

Tang²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Text Generation aims to produce plausible and readable text in human language from input data. The resurgence of deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs). Grounding text generation on PLMs is seen as a promising direction in both academia and industry. In this survey, we present the recent advances achieved in the topic of PLMs for text generation. In detail, we begin with introducing three key points of applying PLMs to text generation: 1) how to encode the input data as representations preserving input semantics which can be fused into PLMs; 2) how to design a universal and performant architecture of PLMs served as generation models; and 3) how to optimize PLMs given the reference text and ensure the generated text satisfying special text properties. Then, we figure out several challenges and future directions within each key point. Next, we present a summary of various useful resources and typical text generation applications to work with PLMs. Finally, we conclude and summarize the contribution of this survey.CCS Concepts: • Computing methodologies → Natural language generation.

show abstract

Section: Text Generationmentioning

confidence: 99%

Pretrained Language Models for Text Generation: A Survey

Li¹,

Tang²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We empirically investigate how to glue the visual pre-trained model (CLIP-ViT) and the language pretrained model (GPT2) together for end-to-end generative vision-and-language pre-training. (Cho et al, 2021;Xia et al, 2021;Fang et al, 2021). However, most of the cross-modal pre-trained models require millions of parallel image-caption data for generative and/or denoising pre-training.…”

Section: Cross-modal Fusionmentioning

confidence: 99%

“…Language Decoder. Most of the previous G-VLP works choose to pre-train their language decoder from scratch (Xia et al, 2021;Cho et al, 2021;Fang et al, 2021). Such setting requires a model to spend extra effort on how to generate smooth sentences, which increases the burden of the model.…”

Section: Model Structurementioning

confidence: 99%

“…(2) Cross-modal Pre-training with >1.0M Distinct Images: we include the results of 5 cross-modal pre-trained models, including Unified VLP (Zhou et al, 2019b), OSCAR (Li et al, 2020a), XGPT (Xia et al, 2021), ViTCAP (Fang et al, 2021) and UniVL . All of them consume around 30 times more parallel data than VC-GPT;…”

Section: Baseline Modelsmentioning

confidence: 99%

“…Differently, our work focuses on how to reduce the need of large parallel corpus by leveraging the single-modal pre-trained models. Furthermore, several works focus on the generative vision-and-language pre-training (Zhou et al, 2019b;Cho et al, 2021;Xia et al, 2021). However, all of them pre-train their models from scratch.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

A Frustratingly Simple Approach for End-to-End Image Captioning

Luo¹,

Xi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Vision-and-language pre-training models (VLMs) have achieved tremendous success in the cross-modal area, but most of them require millions of parallel image-caption data for pre-training. Collating such data is expensive and labor-intensive. In this work, we focus on reducing such need for generative vision-and-language pre-training (G-VLP) by taking advantage of the visual pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2) as decoder. Unfortunately, GPT2 lacks a necessary cross-attention module, which hinders the direct connection of CLIP-ViT and GPT2. To remedy such defects, we conduct extensive experiments to empirically investigate how to design and pre-train our model. Based on our experimental results, we propose a novel G-VLP framework, Visual Conditioned GPT (VC-GPT), and pre-train it with a small-scale parallel image-caption corpus (Visual Genome, only 110k distinct images). Evaluating on the image captioning downstream tasks (MSCOCO and Flickr30k Captioning), VC-GPT achieves either the best or the second-best performance across all evaluation metrics over the previous works which consume around 30 times more parallel data during pre-training.

show abstract

Iconographic Image Captioning for Artworks

Cetinić

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Cited by 33 publications

References 21 publications

Pretrained Language Models for Text Generation: A Survey

Pretrained Language Models for Text Generation: A Survey

A Frustratingly Simple Approach for End-to-End Image Captioning

Iconographic Image Captioning for Artworks

Contact Info

Product

Resources

About