VisualBERT: A Simple and Performant Baseline for Vision and Language

Li, Liunian Harold; Yatskar, Mark; Yin, Dong; Hsieh, Cho‐Jui; Chang, Kai-Wei

doi:10.48550/arxiv.1908.03557

Cited by 458 publications

(533 citation statements)

References 24 publications

Supporting

Mentioning

533

Contrasting

Order By: Relevance

“…Given the generative nature of CM3 in both the language and visual modalities, we used GWEAT/GSEAT to probe our model. Overall, we evaluated six bias tests for gender and seven bias tests for race and found that our family of CM3 models show significantly less bias than other models, speicifically VisualBERT (Li et al, 2019) and ViLBert (Lu et al, 2019). We present our empirical results for gender and race bias in Table 8 and Table 9 respectively.…”

Section: Ethical Considerationsmentioning

confidence: 94%

CM3: A Causal Masked Multimodal Model of the Internet

Aghajanyan¹,

Bernie²,

Ross³

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked languageimage models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multimodal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM De Cao et al., 2020;Aghajanyan et al., 2021). We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

show abstract

Section: Ethical Considerationsmentioning

confidence: 94%

CM3: A Causal Masked Multimodal Model of the Internet

Aghajanyan¹,

Bernie²,

Ross³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Sparked by natural language pre-training, a new wave of visionlanguage pre-training methods have been proposed recently to learn pre-trainable multi-modal encoders for vision-language perception tasks. VisualBERT [19] directly extends BERT by pretraining a Transformer based encoder with two visually-grounded language model objectives: masked language modeling with the image and image-sentence matching. UNITER [5], Unicoder-VL [18], and VL-BERT [38] further introduce masked region modeling proxy tasks to enhance the vision-language alignment during pre-training.…”

Section: Related Workmentioning

confidence: 99%

“…★ denotes our implementation by using the same pre-training data/backbone as in Uni-EDEN. pre-trainable encoder module (i.e., VisualBERT [19], ViLBERT [23], VL-BERT [38], LXMERT [40], and UNITER [5]) for only vision-language perception tasks, and pre-trainable encoder-decoder structure (Unified VLP [51]) for both vision-language perception and generation tasks. For fair comparison with our Uni-EDEN, we re-implement LXMERT and UNITER by pre-training them over Conceptual Captions.…”

Section: Performance Comparisonmentioning

confidence: 99%

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Yehao

Fan

Pan

et al. 2022

ACM Trans. Multimedia Comput. Commun. Appl.

View full text Add to dashboard Cite

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

show abstract

“…Various transformer-based VQA models , Su et al, 2019, Li et al, 2019b,a, Zhou et al, 2019, Chefer et al, 2021 have been introduced in the last few years. Among them, [Tan and Bansal, 2019] and are two-stream transformer architectures that use cross-attention layers and co-attention layers, respectively, to allow information exchange across modalities.…”

Section: Related Workmentioning

confidence: 99%

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Ankur¹,

Kreiman²

2022

Preprint

View full text Add to dashboard Cite

In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicate that co-attention transformer modules are crucial in attending to relevant regions of the image given a question. Importantly, we observe that the semantic meaning of the question is not what drives visual attention, but specific keywords in the question do. Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models and networks that simultaneously process visual and language streams.Preprint. Under review.

show abstract

VisualBERT: A Simple and Performant Baseline for Vision and Language

Cited by 458 publications

References 24 publications

CM3: A Causal Masked Multimodal Model of the Internet

CM3: A Causal Masked Multimodal Model of the Internet

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Contact Info

Product

Resources

About