2020
DOI: 10.1007/978-3-030-58577-8_8
|View full text |Cite
|
Sign up to set email alerts
|

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
876
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 891 publications
(883 citation statements)
references
References 24 publications
7
876
0
Order By: Relevance
“…in the last column) as a general indicator. Although the different architectures of models (i.e., 6L/512H and 12L/768H) affect the fine-tuning results, the voken-classification (Su et al, 2020) Yes 6.4e-3 90.1 89.5 88.6 82.9 VisualBERT (Li et al, 2019) Yes 6.5e-3 90.3 88.9 88.4 82.4 Oscar (Li et al, 2020a) Yes 41.6e-3 87.3 50.5 86.6 77.3 LXMERT (Tan and Bansal, 2019) No task consistently improves the downstream tasks' performance and achieves large average gains. We also show the transferability of our vokenizer to the RoBERTa model and observe the same phenomenon as that in BERT.…”
Section: Resultsmentioning
confidence: 99%
“…in the last column) as a general indicator. Although the different architectures of models (i.e., 6L/512H and 12L/768H) affect the fine-tuning results, the voken-classification (Su et al, 2020) Yes 6.4e-3 90.1 89.5 88.6 82.9 VisualBERT (Li et al, 2019) Yes 6.5e-3 90.3 88.9 88.4 82.4 Oscar (Li et al, 2020a) Yes 41.6e-3 87.3 50.5 86.6 77.3 LXMERT (Tan and Bansal, 2019) No task consistently improves the downstream tasks' performance and achieves large average gains. We also show the transferability of our vokenizer to the RoBERTa model and observe the same phenomenon as that in BERT.…”
Section: Resultsmentioning
confidence: 99%
“…We can see that the object labels can improve the V+L and V&L models. This is reasonable since object labels can be treated as the "anchor" between RoI features and textual features (OCR text and caption) [28].…”
Section: Resultsmentioning
confidence: 99%
“…A more closely related work to our problem is, however, TextVQA (Singh et al, 2019) which focuses on the problem of question/answering with scene texts, having infinite vocabulary. Correspondingly, a variety of solutions have also been proposed -the most successful have been based on attention (Xu et al, 2015;Yang et al, 2016;Anderson et al, 2018) and joint multimodal learning (Tan and Bansal, 2019;Lu et al, 2019;Chen et al, 2019;Li et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…This is the novel and most important module of our framework which performs (a) chart structure understanding, (b) question understanding, and (c) reasoning over the chart to find the answer. We adapt the transformer-based frameworks from (Tan and Bansal, 2019;Lu et al, 2019;Chen et al, 2019;Li et al, 2020) to perform reasoning over charts. We demonstrate, empirically, that the architecture is strongly suited for the task of CQA through extensive experiments.…”
Section: Structure-based Transformersmentioning
confidence: 99%