2023
DOI: 10.48550/arxiv.2302.10035
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Abstract: With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cuttingedge works. Specifica… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 172 publications
(247 reference statements)
0
2
0
Order By: Relevance
“…The kinds of tasks to which GLLMMs can be applied include pre-training in content creation (such as text, audio tracks, images, or videos) that become part of a larger undertaking such as designing art, composing music, or telling a story. Another example would be language translation where GLLMMs can convert text from one language to another with greater accuracy because they have also been primed by gaining earlier access to cultural nuances and context [12]. An example where GLLMMs are applied in the field of medicine would be to analyze the results of medical imaging more accurately (such as X-ray, MRI scans, or ultrasound) by pretraining with access to additional databases that relate to the relevant histology and pathology implicated in the imaging results [13].…”
Section: A Brief Primer On the Concepts And Nomenclature Of Aimentioning
confidence: 99%
“…The kinds of tasks to which GLLMMs can be applied include pre-training in content creation (such as text, audio tracks, images, or videos) that become part of a larger undertaking such as designing art, composing music, or telling a story. Another example would be language translation where GLLMMs can convert text from one language to another with greater accuracy because they have also been primed by gaining earlier access to cultural nuances and context [12]. An example where GLLMMs are applied in the field of medicine would be to analyze the results of medical imaging more accurately (such as X-ray, MRI scans, or ultrasound) by pretraining with access to additional databases that relate to the relevant histology and pathology implicated in the imaging results [13].…”
Section: A Brief Primer On the Concepts And Nomenclature Of Aimentioning
confidence: 99%
“…• Transferable debiasing techniques is developing debiasing techniques that are designed to be transferable across different models, datasets, or domains. These techniques may incorporate generalization principles, domain-independent features, or model-agnostic approaches that enable their application to diverse settings [176], [177].…”
Section: Lack Of Transferabilitymentioning
confidence: 99%
“…Du et al [62] and Chen et al [63] reviewed VLM pre-training for visionlanguage tasks [57], [58], [60]. Xu et al [64] and Wang et al [65] shared recent progress of multi-modal learning on multi-modal tasks (e.g., language, vision and auditory modalities). Differently, we review VLMs for visual recognition tasks from three major aspects: 1) Recent progress of VLM pre-training for visual recognition tasks; 2) Two typical transfer approaches from VLMs to visual recognition tasks, i.e., transfer learning approach and knowledge distillation approach; 3) Benchmarking of state-of-the-art VLM pretraining methods on visual recognition tasks.…”
Section: Relevant Surveysmentioning
confidence: 99%