2019
DOI: 10.48550/arxiv.1908.07490
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
365
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 223 publications
(365 citation statements)
references
References 26 publications
0
365
0
Order By: Relevance
“…Various transformer-based VQA models , Su et al, 2019, Li et al, 2019b,a, Zhou et al, 2019, Chefer et al, 2021 have been introduced in the last few years. Among them, [Tan and Bansal, 2019] and are two-stream transformer architectures that use cross-attention layers and co-attention layers, respectively, to allow information exchange across modalities. There are several studies on the interpretability of VQA models [Goyal et al, 2016, Kafle and Kanan, 2017, Jabri et al, 2016, and yet very few have focused on the co-attention transformer layers used in recent VQA models.…”
Section: Related Workmentioning
confidence: 99%
“…Various transformer-based VQA models , Su et al, 2019, Li et al, 2019b,a, Zhou et al, 2019, Chefer et al, 2021 have been introduced in the last few years. Among them, [Tan and Bansal, 2019] and are two-stream transformer architectures that use cross-attention layers and co-attention layers, respectively, to allow information exchange across modalities. There are several studies on the interpretability of VQA models [Goyal et al, 2016, Kafle and Kanan, 2017, Jabri et al, 2016, and yet very few have focused on the co-attention transformer layers used in recent VQA models.…”
Section: Related Workmentioning
confidence: 99%
“…(i.e., demonstrative term in the sentence) For example, when translating the sentence "The animal didn't cross the street because it was too tired," it would be helpful to know which word "it" refers to, as this would greatly improve the translation result. Due to the selfattention property, the Transformer architecture has been applied to other tasks beyond image captioning [24] or visual question answering [34]; it also shows up in the works of vision-and-language navigation [11] and video understanding [33]. Furthermore, implementation of the Transformer architecture aids the model in learning cross-modal representations from a concatenated sequence of visual region features and language token embeddings [18,32].…”
Section: Transformers In the Multi-modal Taskmentioning
confidence: 99%
“…LCGN [13], NSM [15], and LRTA [22] mainly focus on solving complicated visual questions by first constructing graphs that represent the underlying semantics. LXMERT [34], Oscar [21], and VinVL [43] are Transformer-based models that solve the visual language problem by pre-training the model to align visual concepts and corresponding concepts in the text modal- ity. The table shows that our model outperforms the stateof-the-art methods, even those need pre-training.…”
Section: Overall Performancementioning
confidence: 99%
“…Multiple VQA datasets have been proposed, such as Visual Genome QA [25] VQA [2], GQA [16], CLEVR [22], MovieQA [53] and so on. Many works have shown state-of-the-art performance on VQA tasks, including task-specific VQA models with various cross-modality fusion mechanisms [13,20,24,49,62,66,67] and joint vision-language models that are pretrained on large-scale vision-language corpus and finetuned on VQA tasks [6,11,29,30,33,52,68]. Please note that the conventional VQA task does not require external knowledge by definition, although studies show some VQA questions may require commonsense knowledge to answer correctly [2].…”
Section: Related Workmentioning
confidence: 99%