2023
DOI: 10.1007/s11633-022-1369-5
|View full text |Cite
|
Sign up to set email alerts
|

VLP: A Survey on Vision-language Pre-training

Abstract: In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-train… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 86 publications
(20 citation statements)
references
References 111 publications
0
15
0
Order By: Relevance
“…We will briefly review existing ideas and methods that are highly related to our work. For more details, please refer to previous survey papers [1], [2], [27], [28], [29].…”
Section: Related Workmentioning
confidence: 99%
“…We will briefly review existing ideas and methods that are highly related to our work. For more details, please refer to previous survey papers [1], [2], [27], [28], [29].…”
Section: Related Workmentioning
confidence: 99%
“…For instance, Li et al [61] shared advances on vision-language tasks, including VLM pretraining for various task-specific methods. Du et al [62] and Chen et al [63] reviewed VLM pre-training for visionlanguage tasks [57], [58], [60]. Xu et al [64] and Wang et al [65] shared recent progress of multi-modal learning on multi-modal tasks (e.g., language, vision and auditory modalities).…”
Section: Relevant Surveysmentioning
confidence: 99%
“…Since 2014, the ascendancy of deep learning techniques has reverberated in cross-modal retrieval, harnessing the potency of deep neural networks to autonomously glean high-level feature representations from multi-modal data [5]. In recent years, a cascade of crossmodal retrieval approaches has been tailored to diverse open scenarios, harnessing the potential of vision-language pretraining models [6]. These strides have notably bolstered the precision, robustness, and scalability of cross-modal retrieval systems by infusing sophisticated learning models and training strategies.…”
Section: Introductionmentioning
confidence: 99%