A Corpus for Reasoning About Natural Language Grounded in Photographs

Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-/Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.

show abstract

“…GQA [18] and NLVR2 [44], as considered in the LXMERT paper. With UNITER-backbone for DCVLP, the pre-training data is the same as for UNITER.…”

Section: Methodsmentioning

confidence: 99%

Dense Contrastive Visual-Linguistic Pretraining

Shi

Shuang

Geng

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…During inference, we constrain the decoder to only generate from the 3,192 candidate answers to make a fair comparison with existing methods. Natural Language for Visual Reasoning (NLVR 2 (Suhr et al, 2018)) Since the task asks the model to distinguish whether a text describes a pair of images, we follow ALBEF to extend the cross-modal encoder to enable reasoning over two images. We also perform an additional pre-training step for 1 epoch using the 4M images: given a pair of images and a text, the model needs to assign the text to either the first image, the second image, or none of them.…”

Section: B Implementation Details Of Downstream Tasksmentioning

confidence: 99%

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Zeng¹,

Zhang²,

Li³

2021

Preprint

View full text Add to dashboard Cite

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection, and make fine-grained alignments between the extracted features and texts. We argue that the use of object detection may not be suitable for vision language pre-training. Instead, we point out that the task should be performed so that the regions of 'visual concepts' mentioned in the texts are located in the images, and in the meantime alignments between texts and visual concepts are identified, where the alignments are in multi-granularity. This paper proposes a new method called X-VLM to perform 'multigrained vision language pre-training'. Experimental results show that X-VLM consistently outperforms state-of-the-art methods in many downstream vision language tasks.

show abstract

“…Image patches and text tokens embeddings are feed into transformer or self-attention model to learn fused cross-modal attention. The great progress of these recently developed model can be witnessed on the leader boards of various tasks without using ensembling such as VQA, GAQ [37], NLVR2 [38], which can mainly be attributed to the availability of large scale weakly correlated multimodal data (typically captioned images or video clips and accompanying subtitles [39]) that can be utilised to learn cross-modal representation by contrastive learning [40]. However, existing pre-trained models use mostly scene-limited image-text pairs with short and relatively simple descriptive captions for images, while ignoring richer uni-modal text data and domainspecific information.…”

Section: Related Workmentioning

confidence: 99%

Logically at Factify 2022: Multimodal Fact Verification

Gao¹,

Hoffmann²,

Oikonomou³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multimodal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.

show abstract

A Corpus for Reasoning About Natural Language Grounded in Photographs

Cited by 42 publications

References 0 publications

Dense Contrastive Visual-Linguistic Pretraining

Dense Contrastive Visual-Linguistic Pretraining

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

Logically at Factify 2022: Multimodal Fact Verification

Contact Info

Product

Resources

About