Yuejian Fang scite author profile

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM (Lample and Conneau 2019) and Unicoder (Huang et al. 2019), both visual and linguistic contents are fed into a multi-layer Transformer (Vaswani et al. 2017) for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling(MLM), Masked Object Classification(MOC) and Visual-linguistic Matching(VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

show abstract

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Li¹,

Duan²,

Fang³

et al. 2019

Preprint

View full text Add to dashboard Cite

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Liang

et al. 2022

View full text Add to dashboard Cite

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Wu¹,

Ji²,

Ji³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yuejian Fang

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Contact Info

Product

Resources

About