2021
DOI: 10.1007/978-3-030-88480-2_63
|View full text |Cite
|
Sign up to set email alerts
|

XGPT: Cross-modal Generative Pre-Training for Image Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 33 publications
(31 citation statements)
references
References 21 publications
0
18
0
Order By: Relevance
“…• When the input data 𝑥 is multimedia input such as image and speech, text generation becomes image caption [192] or speech recognition [43]. In image caption, we might expect the generated caption text to be vivid for attracting children, while in speech recognition, the transformed text must be faithful to the original speech.…”
Section: Text Generationmentioning
confidence: 99%
“…• When the input data 𝑥 is multimedia input such as image and speech, text generation becomes image caption [192] or speech recognition [43]. In image caption, we might expect the generated caption text to be vivid for attracting children, while in speech recognition, the transformed text must be faithful to the original speech.…”
Section: Text Generationmentioning
confidence: 99%
“…We empirically investigate how to glue the visual pre-trained model (CLIP-ViT) and the language pretrained model (GPT2) together for end-to-end generative vision-and-language pre-training. (Cho et al, 2021;Xia et al, 2021;Fang et al, 2021). However, most of the cross-modal pre-trained models require millions of parallel image-caption data for generative and/or denoising pre-training.…”
Section: Cross-modal Fusionmentioning
confidence: 99%
“…Language Decoder. Most of the previous G-VLP works choose to pre-train their language decoder from scratch (Xia et al, 2021;Cho et al, 2021;Fang et al, 2021). Such setting requires a model to spend extra effort on how to generate smooth sentences, which increases the burden of the model.…”
Section: Model Structurementioning
confidence: 99%
See 2 more Smart Citations