2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01519
|View full text |Cite
|
Sign up to set email alerts
|

FLAVA: A Foundational Language And Vision Alignment Model

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
93
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 212 publications
(143 citation statements)
references
References 39 publications
0
93
0
Order By: Relevance
“…Vision-Language Recognition: The recent paradigm of vision-language pretraining [17,33,41,50,51], in which models are trained on large corpora of image-text pairs, has enabled vision to grow past the fixed-category paradigm. Models such as CLIP [33] and ALIGN [17] learn a joint representation over images and text via a contrastive loss that pulls corresponding image-text pairs together in representation space, while pushing non-corresponding pairs apart.…”
Section: Related Workmentioning
confidence: 99%
“…Vision-Language Recognition: The recent paradigm of vision-language pretraining [17,33,41,50,51], in which models are trained on large corpora of image-text pairs, has enabled vision to grow past the fixed-category paradigm. Models such as CLIP [33] and ALIGN [17] learn a joint representation over images and text via a contrastive loss that pulls corresponding image-text pairs together in representation space, while pushing non-corresponding pairs apart.…”
Section: Related Workmentioning
confidence: 99%
“…Localized Narratives (Pont-Tuset et al, 2020) is another new image-text dataset, where the annotators are asked to describe an image with their voice while simultaneously hovering their mouse over the region they are describing; as a result, each image corresponds to a long paragraph. This dataset has recently been used for image-text pre-training in FLAVA (Singh et al, 2022). Besides imagecaption datasets, existing works, such as UNIMO (Li et al, 2021e), UNIMO-2, and VL-BEiT (Bao et al, 2022b), also propose to use image-only and text-only datasets for multimodal pre-training.…”
Section: Pre-training Datasetsmentioning
confidence: 99%
“…the self-attention jointly attends over the tokens of both modalities. Dual-stream models use separate Transformers for each modality that are connected through a co-attention mechanism (Tan and Bansal, 2019;Lu et al, 2019), concatenated in a single-stream model on top (Singh et al, 2022;Kamath et al, 2021), or the image model output is used asymmetrically for cross-attention in the text model .…”
Section: Related Workmentioning
confidence: 99%