Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Li, Xiujun; Yin, Xi; Li, Chunyuan; Zhang, Pengchuan; Hu, Xiaowei; Zhang, Lei; Wang, Limin; Hu, Houdong; Dong, Li; Wei, Furu; Choi, Yejin; Gao, Jianfeng

doi:10.1007/978-3-030-58577-8_8

Cited by 891 publications

(883 citation statements)

References 24 publications

Supporting

Mentioning

876

Contrasting

Order By: Relevance

“…in the last column) as a general indicator. Although the different architectures of models (i.e., 6L/512H and 12L/768H) affect the fine-tuning results, the voken-classification (Su et al, 2020) Yes 6.4e-3 90.1 89.5 88.6 82.9 VisualBERT (Li et al, 2019) Yes 6.5e-3 90.3 88.9 88.4 82.4 Oscar (Li et al, 2020a) Yes 41.6e-3 87.3 50.5 86.6 77.3 LXMERT (Tan and Bansal, 2019) No task consistently improves the downstream tasks' performance and achieves large average gains. We also show the transferability of our vokenizer to the RoBERTa model and observe the same phenomenon as that in BERT.…”

Section: Resultsmentioning

confidence: 99%

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Tan

Bansal

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple purelanguage tasks such as GLUE, SQuAD, and SWAG. 1 listening learn Vokenization Humans learn language by listening, speaking ... humans Language Input BERT Transformer Model Masked Language Model [MASK] language by [MASK] speaking humans Language Input BERT Transformer Model Voken Classification Task [MASK] language by [MASK] speaking Masked Tokens Vokens (Token-Related Images)

show abstract

Section: Resultsmentioning

confidence: 99%

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Tan

Bansal

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…We can see that the object labels can improve the V+L and V&L models. This is reasonable since object labels can be treated as the "anchor" between RoI features and textual features (OCR text and caption) [28].…”

Section: Resultsmentioning

confidence: 99%

Multimodal Learning For Hateful Memes Detection

Zhou¹,

Yang

2021

2021 IEEE International Conference on Multimedia &Amp; Expo Workshops (ICMEW)

View full text Add to dashboard Cite

Memes are used for spreading ideas through social networks. Although most memes are created for humor, some memes become hateful under the combination of pictures and text. Automatically detecting hateful memes can help reduce their harmful social impact. Compared to the conventional multimodal tasks, where the visual and textual information is semantically aligned, hateful memes detection is a more challenging task since the image and text in memes are weakly aligned or even irrelevant. Thus it requires the model to have a deep understanding of the content and perform reasoning over multiple modalities. This paper focuses on multimodal hateful memes detection and proposes a novel method incorporating the image captioning process into the memes detection process. We conduct extensive experiments on multimodal meme datasets and illustrate the effectiveness of our approach. Our model achieves promising results on the Hateful Memes Detection Challenge. Our code is made publicly available at GitHub.

show abstract

“…A more closely related work to our problem is, however, TextVQA (Singh et al, 2019) which focuses on the problem of question/answering with scene texts, having infinite vocabulary. Correspondingly, a variety of solutions have also been proposed -the most successful have been based on attention (Xu et al, 2015;Yang et al, 2016;Anderson et al, 2018) and joint multimodal learning (Tan and Bansal, 2019;Lu et al, 2019;Chen et al, 2019;Li et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

“…This is the novel and most important module of our framework which performs (a) chart structure understanding, (b) question understanding, and (c) reasoning over the chart to find the answer. We adapt the transformer-based frameworks from (Tan and Bansal, 2019;Lu et al, 2019;Chen et al, 2019;Li et al, 2020) to perform reasoning over charts. We demonstrate, empirically, that the architecture is strongly suited for the task of CQA through extensive experiments.…”

Section: Structure-based Transformersmentioning

confidence: 99%

STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering

Singh

Shekhar

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Chart Question Answering (CQA) is the task of answering natural language questions about visualisations in the chart image. Recent solutions, inspired by VQA approaches, rely on image-based attention for question/answering while ignoring the inherent chart structure. We propose STL-CQA which improves the question/answering through sequential elements localization, question encoding and then, a structural transformer-based learning approach. We conduct extensive experiments while proposing pre-training tasks, methodology and also an improved dataset with more complex and balanced questions of different types. The proposed methodology shows a significant accuracy improvement compared to the stateof-the-art approaches on various chart Q/A datasets, while outperforming even human baseline on the DVQA Dataset. We also demonstrate interpretability while examining different components in the inference pipeline.

show abstract

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Cited by 891 publications

References 24 publications

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Multimodal Learning For Hateful Memes Detection

STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering

Contact Info

Product

Resources

About