2019
DOI: 10.48550/arxiv.1908.03557
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VisualBERT: A Simple and Performant Baseline for Vision and Language

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
533
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 458 publications
(533 citation statements)
references
References 24 publications
0
533
0
Order By: Relevance
“…Given the generative nature of CM3 in both the language and visual modalities, we used GWEAT/GSEAT to probe our model. Overall, we evaluated six bias tests for gender and seven bias tests for race and found that our family of CM3 models show significantly less bias than other models, speicifically VisualBERT (Li et al, 2019) and ViLBert (Lu et al, 2019). We present our empirical results for gender and race bias in Table 8 and Table 9 respectively.…”
Section: Ethical Considerationsmentioning
confidence: 94%
“…Given the generative nature of CM3 in both the language and visual modalities, we used GWEAT/GSEAT to probe our model. Overall, we evaluated six bias tests for gender and seven bias tests for race and found that our family of CM3 models show significantly less bias than other models, speicifically VisualBERT (Li et al, 2019) and ViLBert (Lu et al, 2019). We present our empirical results for gender and race bias in Table 8 and Table 9 respectively.…”
Section: Ethical Considerationsmentioning
confidence: 94%
“…Sparked by natural language pre-training, a new wave of visionlanguage pre-training methods have been proposed recently to learn pre-trainable multi-modal encoders for vision-language perception tasks. VisualBERT [19] directly extends BERT by pretraining a Transformer based encoder with two visually-grounded language model objectives: masked language modeling with the image and image-sentence matching. UNITER [5], Unicoder-VL [18], and VL-BERT [38] further introduce masked region modeling proxy tasks to enhance the vision-language alignment during pre-training.…”
Section: Related Workmentioning
confidence: 99%
“…★ denotes our implementation by using the same pre-training data/backbone as in Uni-EDEN. pre-trainable encoder module (i.e., VisualBERT [19], ViLBERT [23], VL-BERT [38], LXMERT [40], and UNITER [5]) for only vision-language perception tasks, and pre-trainable encoder-decoder structure (Unified VLP [51]) for both vision-language perception and generation tasks. For fair comparison with our Uni-EDEN, we re-implement LXMERT and UNITER by pre-training them over Conceptual Captions.…”
Section: Performance Comparisonmentioning
confidence: 99%
“…Various transformer-based VQA models , Su et al, 2019, Li et al, 2019b,a, Zhou et al, 2019, Chefer et al, 2021 have been introduced in the last few years. Among them, [Tan and Bansal, 2019] and are two-stream transformer architectures that use cross-attention layers and co-attention layers, respectively, to allow information exchange across modalities.…”
Section: Related Workmentioning
confidence: 99%