2020
DOI: 10.48550/arxiv.2009.11278
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Abstract: Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
15
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(15 citation statements)
references
References 40 publications
0
15
0
Order By: Relevance
“…There has since been a surge of interest in training multi-stage attention based GAN architectures for this task. While the conventional setting (Zhang et al, 2017;Li et al, 2019a;Zhu et al, 2019) assumes only the availability of (text,image) pairs at training time, recently a second setting has emerged that assumes availability of bounding-box/shape-mask information of objects attributes during training (Li et al, 2019b;Hinz et al, 2019;Cho et al, 2020;Liang et al, 2020). We highlight that this represents a significantly easier problem setting and that such methods are not feasible where bounding-box/shape information is unavailable (such as the CUB dataset).…”
Section: Related Workmentioning
confidence: 98%
“…There has since been a surge of interest in training multi-stage attention based GAN architectures for this task. While the conventional setting (Zhang et al, 2017;Li et al, 2019a;Zhu et al, 2019) assumes only the availability of (text,image) pairs at training time, recently a second setting has emerged that assumes availability of bounding-box/shape-mask information of objects attributes during training (Li et al, 2019b;Hinz et al, 2019;Cho et al, 2020;Liang et al, 2020). We highlight that this represents a significantly easier problem setting and that such methods are not feasible where bounding-box/shape information is unavailable (such as the CUB dataset).…”
Section: Related Workmentioning
confidence: 98%
“…With great progress in language tasks [55,44,45,2], the transformer architecture is being rapidly transferred to other fields such as vision [3,12] and audio [6]. Recently, pretraining visual-language transformer [43,24,68,7,35,53] (e.g. multi-modal BERT) has achieved significant improvements on a variety of downstream tasks, e.g.…”
Section: Related Workmentioning
confidence: 99%
“…We train UFC-BERT via the masked sequence modeling task, which predicts a masked subset of the target image's tokens conditioned on both the multi-modal control signals and the generation target's unmasked tokens. During inference, we adopt Mask-Predict, a NAR generation algorithm [16,21,7], which predicts all target tokens at the first iteration and then iteratively re-mask and re-predict a subset of tokens with low confidence scores. To further improve upon the NAR generation algorithm, we exploit the discriminative capability of the BERT architecture [11,68] and add two estimators (see Figure 2), where one estimator estimates the relevance between the generated image and the control signals, and the other one estimates the image's fidelity.…”
Section: Introductionmentioning
confidence: 99%
“…Recently there has been interest in the generative modeling community in reconstructing videos from lower bitrate alternatives such as text or low dimensional latent spaces [23,26,42]. While there has been significant progress on using generative machine learning to model natural images from text [9,27,31], these approaches are currently unable to produce high-quality videos. To recreate webcam video data, 2D [30,33,40,44,45] or 3D graphics based methods [14,17,34] have been used successfully to generate realistic talking-head videos.…”
Section: Introductionmentioning
confidence: 99%