FLAVA: A Foundational Language And Vision Alignment Model

Singh, Amanpreet; Hu, Ronghang; Goswami, Vedanuj; Couairon, Guillaume; Galuba, Wojciech; Rohrbach, Marcus; Kiela, Douwe

doi:10.1109/cvpr52688.2022.01519

Cited by 212 publications

(143 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Vision-Language Recognition: The recent paradigm of vision-language pretraining [17,33,41,50,51], in which models are trained on large corpora of image-text pairs, has enabled vision to grow past the fixed-category paradigm. Models such as CLIP [33] and ALIGN [17] learn a joint representation over images and text via a contrastive loss that pulls corresponding image-text pairs together in representation space, while pushing non-corresponding pairs apart.…”

Section: Related Workmentioning

confidence: 99%

Affective Faces for Goal-Driven Dyadic Communication

Scott¹,

Teotia²,

Tendulkar³

et al. 2023

Preprint

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Affective Faces for Goal-Driven Dyadic Communication

Scott¹,

Teotia²,

Tendulkar³

et al. 2023

Preprint

View full text Add to dashboard Cite

“…Localized Narratives (Pont-Tuset et al, 2020) is another new image-text dataset, where the annotators are asked to describe an image with their voice while simultaneously hovering their mouse over the region they are describing; as a result, each image corresponds to a long paragraph. This dataset has recently been used for image-text pre-training in FLAVA (Singh et al, 2022). Besides imagecaption datasets, existing works, such as UNIMO (Li et al, 2021e), UNIMO-2, and VL-BEiT (Bao et al, 2022b), also propose to use image-only and text-only datasets for multimodal pre-training.…”

Section: Pre-training Datasetsmentioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

“…the self-attention jointly attends over the tokens of both modalities. Dual-stream models use separate Transformers for each modality that are connected through a co-attention mechanism (Tan and Bansal, 2019;Lu et al, 2019), concatenated in a single-stream model on top (Singh et al, 2022;Kamath et al, 2021), or the image model output is used asymmetrically for cross-attention in the text model .…”

Section: Related Workmentioning

confidence: 99%

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Geigle¹,

Liu²,

Pfeiffer³

et al. 2022

Preprint

View full text Add to dashboard Cite

Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs-of different architectures, trained on different data and objectives-are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a single pre-trained VE can serve as a generalpurpose encoder. In this work, we evaluate whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our results and analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not repurposed, but explicitly designed for V+L tasks, have the potential of improving performance on the target V+L tasks.

show abstract

FLAVA: A Foundational Language And Vision Alignment Model

Cited by 212 publications

References 39 publications

Affective Faces for Goal-Driven Dyadic Communication

Affective Faces for Goal-Driven Dyadic Communication

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Contact Info

Product

Resources

About