2020
DOI: 10.1609/aaai.v34i07.7005
|View full text |Cite
|
Sign up to set email alerts
|

Unified Vision-Language Pre-Training for Image Captioning and VQA

Abstract: This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
410
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 587 publications
(410 citation statements)
references
References 28 publications
(48 reference statements)
0
410
0
Order By: Relevance
“…While multi-head attention has been widely exploited in many vision-language (VL) tasks, such as image captioning (Zhou et al, 2020), visual question answering (Tan and Bansal, 2019;, and visual dialog (Kang et al, 2019;Wang et al, 2020), its potential benefit to model flexible cross-media posts has been previously ignored. Due to the informal style in social media, cross-media keyphrase prediction brings unique difficulties mainly in two aspects: first, its textimage relationship is rather complicated (Vempala and Preotiuc-Pietro, 2019) while in conventional VL tasks the two modalities have most semantics shared; second, social media images usually exhibit a more diverse distribution and a much higher probability of containing OCR tokens ( §4), thereby posing a hurdle for effectively processing.…”
Section: Related Workmentioning
confidence: 99%
“…While multi-head attention has been widely exploited in many vision-language (VL) tasks, such as image captioning (Zhou et al, 2020), visual question answering (Tan and Bansal, 2019;, and visual dialog (Kang et al, 2019;Wang et al, 2020), its potential benefit to model flexible cross-media posts has been previously ignored. Due to the informal style in social media, cross-media keyphrase prediction brings unique difficulties mainly in two aspects: first, its textimage relationship is rather complicated (Vempala and Preotiuc-Pietro, 2019) while in conventional VL tasks the two modalities have most semantics shared; second, social media images usually exhibit a more diverse distribution and a much higher probability of containing OCR tokens ( §4), thereby posing a hurdle for effectively processing.…”
Section: Related Workmentioning
confidence: 99%
“…• L. Zhou et al [19], Vision Language Pre-training (VLP): State-of-the-art attention-based model for vision-language generation tasks such as image captioning and visual question-answering. • S. Zaho et al [8], Feature-based IER: Image Emotion Recognition approach using low and mid-level visual features.…”
Section: State-of-the-art (Sota) Methods For Performance Comparisonmentioning
confidence: 99%
“…We briefly describe our base captioning model, which consists of a Faster R-CNN and Transformerbased encoder-decoder, following the sequence-tosequence framework common in state-of-the-art image captioning systems (Anderson et al, 2018;Vinyals et al, 2015;Zhou et al, 2019). See Appendix A for full technical details.…”
Section: Model Architecturementioning
confidence: 99%