2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01278
|View full text |Cite
|
Sign up to set email alerts
|

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
108
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 180 publications
(128 citation statements)
references
References 23 publications
0
108
0
Order By: Relevance
“…Vision-Language Pretraining. Recent years have witnessed great success in Vision-Language pretraining models [4,11,12,14,18,22,26,31,34,41,42] based on Transformer architectures [32]. Image structures have been proven useful to pretraining models, such as scene graphs [38].…”
Section: Related Workmentioning
confidence: 99%
“…Vision-Language Pretraining. Recent years have witnessed great success in Vision-Language pretraining models [4,11,12,14,18,22,26,31,34,41,42] based on Transformer architectures [32]. Image structures have been proven useful to pretraining models, such as scene graphs [38].…”
Section: Related Workmentioning
confidence: 99%
“…Vision-Language Pretraining: Joint vision-language pretraining has been an effective approach to improve performance on VL tasks. Many works, such as LX-MERT [45], VL-BERT [43], UNITER [10], VILLA [15], and others [19,20,25,28,52,57] have leveraged this approach. All of these works utilized datasets on the order of 10M samples for various pretraining objectives.…”
Section: Related Workmentioning
confidence: 99%
“…Considering this, most recent works that have achieved state-of-art on these tasks tend to capture logic and prior knowledge with various cross-modal transformer architectures that jointly model both modalities in a unified architecture [10,15,19,20,43,45]. These models are typically pretrained on large vision-language datasets composed of paired images and text, and optimized via a combination of supervised and self-supervised objective functions.…”
Section: Introductionmentioning
confidence: 99%
“…However, such kind of strategy suffers from limitations such as i) extracting region features using an object detector is computationally inefficient and ii) the quality of visual features is largely limited by the predefined visual vocabulary in pre-trained object detectors. To address this issue, rather than rely on region-based visual features, SOHO [18] takes a whole image as input and extracts compact image features through a visual dictionary, which favors 10 times faster inference time than region-based methods. ViLT [20] totally discards convolutional visual features and adopt vision transformer [11] to model long-range dependencies over a sequence of fixedsize non-overlapping image patches.…”
Section: Related Workmentioning
confidence: 99%