LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Tan, Hao; Bansal, Mohit

doi:10.48550/arxiv.1908.07490

Cited by 223 publications

(365 citation statements)

References 26 publications

Supporting

Mentioning

365

Contrasting

Order By: Relevance

“…Various transformer-based VQA models , Su et al, 2019, Li et al, 2019b,a, Zhou et al, 2019, Chefer et al, 2021 have been introduced in the last few years. Among them, [Tan and Bansal, 2019] and are two-stream transformer architectures that use cross-attention layers and co-attention layers, respectively, to allow information exchange across modalities. There are several studies on the interpretability of VQA models [Goyal et al, 2016, Kafle and Kanan, 2017, Jabri et al, 2016, and yet very few have focused on the co-attention transformer layers used in recent VQA models.…”

Section: Related Workmentioning

confidence: 99%

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Ankur¹,

Kreiman²

2022

Preprint

View full text Add to dashboard Cite

In recent years, multi-modal transformers have shown significant progress in Vision-Language tasks, such as Visual Question Answering (VQA), outperforming previous architectures by a considerable margin. This improvement in VQA is often attributed to the rich interactions between vision and language streams. In this work, we investigate the efficacy of co-attention transformer layers in helping the network focus on relevant regions while answering the question. We generate visual attention maps using the question-conditioned image attention scores in these co-attention layers. We evaluate the effect of the following critical components on visual attention of a state-of-the-art VQA model: (i) number of object region proposals, (ii) question part of speech (POS) tags, (iii) question semantics, (iv) number of co-attention layers, and (v) answer accuracy. We compare the neural network attention maps against human attention maps both qualitatively and quantitatively. Our findings indicate that co-attention transformer modules are crucial in attending to relevant regions of the image given a question. Importantly, we observe that the semantic meaning of the question is not what drives visual attention, but specific keywords in the question do. Our work sheds light on the function and interpretation of co-attention transformer layers, highlights gaps in current networks, and can guide the development of future VQA models and networks that simultaneously process visual and language streams.Preprint. Under review.

show abstract

Section: Related Workmentioning

confidence: 99%

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

Ankur¹,

Kreiman²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…(i.e., demonstrative term in the sentence) For example, when translating the sentence "The animal didn't cross the street because it was too tired," it would be helpful to know which word "it" refers to, as this would greatly improve the translation result. Due to the selfattention property, the Transformer architecture has been applied to other tasks beyond image captioning [24] or visual question answering [34]; it also shows up in the works of vision-and-language navigation [11] and video understanding [33]. Furthermore, implementation of the Transformer architecture aids the model in learning cross-modal representations from a concatenated sequence of visual region features and language token embeddings [18,32].…”

Section: Transformers In the Multi-modal Taskmentioning

confidence: 99%

“…LCGN [13], NSM [15], and LRTA [22] mainly focus on solving complicated visual questions by first constructing graphs that represent the underlying semantics. LXMERT [34], Oscar [21], and VinVL [43] are Transformer-based models that solve the visual language problem by pre-training the model to align visual concepts and corresponding concepts in the text modal- ity. The table shows that our model outperforms the stateof-the-art methods, even those need pre-training.…”

Section: Overall Performancementioning

confidence: 99%

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

Xiong¹,

Shen²,

Jin³

2022

Preprint

View full text Add to dashboard Cite

Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. Moreover, reasoning in visual question answering requires the model to understand both image and question, and align them in the same space, rather than simply memorize statistics about the question-answer pairs. Thus, it is essential to find component connections between different modalities and within each modality to achieve better attention. Previous works learned attention weights directly on the features. However, the improvement is limited since these two modality features are in two domains: image features are highly diverse, lacking structure and grammatical rules as language, and natural language features have a higher probability of missing detailed information. To better learn the attention between visual and text, we focus on how to construct input stratification and embed structural information to improve the alignment between different level components. We propose Multi-GranularityAlignment architecture for Visual Question Answering task (MGA-VQA), which learns intra-and inter-modality correlations by multi-granularity alignment, and outputs the final result by the decision fusion module. In contrast to previous works, our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations. The experiments on the VQA-v2 and GQA datasets demonstrate that our model significantly outperforms non-pretrained state-of-the-art methods on both datasets without extra pretraining data and annotations. Moreover, it even achieves better results over the pre-trained methods on GQA.

show abstract

“…Multiple VQA datasets have been proposed, such as Visual Genome QA [25] VQA [2], GQA [16], CLEVR [22], MovieQA [53] and so on. Many works have shown state-of-the-art performance on VQA tasks, including task-specific VQA models with various cross-modality fusion mechanisms [13,20,24,49,62,66,67] and joint vision-language models that are pretrained on large-scale vision-language corpus and finetuned on VQA tasks [6,11,29,30,33,52,68]. Please note that the conventional VQA task does not require external knowledge by definition, although studies show some VQA questions may require commonsense knowledge to answer correctly [2].…”

Section: Related Workmentioning

confidence: 99%

A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

Gao¹,

Peng²,

Thattai³

et al. 2022

Preprint

View full text Add to dashboard Cite

Outside-knowledge visual question answering (OK-VQA) requires the agent to comprehend the image, make use of relevant knowledge from the entire web, and digest all the information to answer the question. Most previous works address the problem by first fusing the image and question in the multi-modal space, which is inflexible for further fusion with a vast amount of external knowledge. In this paper, we call for a paradigm shift for the OK-VQA task, which transforms the image into plain text, so that we can enable knowledge passage retrieval, and generative question-answering in the natural language space. This paradigm takes advantage of the sheer volume of gigantic knowledge bases and the richness of pretrained language models. A Transform-Retrieve-Generate framework (TRiG) framework is proposed 1 , which can be plug-and-played with alternative image-to-text models and textual knowledge bases. Experimental results show that our TRiG framework outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.

show abstract

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Cited by 223 publications

References 26 publications

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

Contact Info

Product

Resources

About