Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.707
|View full text |Cite
|
Sign up to set email alerts
|

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Abstract: Mirroring the success of masked language models, vision-and-language counterparts like VILBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this mode… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
51
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 56 publications
(52 citation statements)
references
References 45 publications
0
51
0
1
Order By: Relevance
“…Dazu gehörten beispielsweise die Objekterkennung und somit die Verbindung zwischen bildlichen und textlichen Darstellungen. Die semantische Analyse von Bildern oder Videos wird dabei bis heute beforscht [33,34]. Image GPT ist dabei ein aktuelles Beispiel, wie anhand von Bildern einem System Bezeichner, auch Labels genannt, angelernt werden können.…”
Section: Die Enkel Locards -Quo Vadis?unclassified
“…Dazu gehörten beispielsweise die Objekterkennung und somit die Verbindung zwischen bildlichen und textlichen Darstellungen. Die semantische Analyse von Bildern oder Videos wird dabei bis heute beforscht [33,34]. Image GPT ist dabei ein aktuelles Beispiel, wie anhand von Bildern einem System Bezeichner, auch Labels genannt, angelernt werden können.…”
Section: Die Enkel Locards -Quo Vadis?unclassified
“…The Transformer accepts a sequence of image and text representations as inputs, encodes them to contextualized vector representations, and outputs image and text tokens. For text-toimage generation, we follow X-LXMERT [2] to use an GAN-based image generator to convert image tokens to a real scene image.…”
Section: Approach 21 Pipelinementioning
confidence: 99%
“…We use the original grid features as visual inputs for image-to-text generation tasks to reduce the loss of image information. We adopt discrete clustering features of the original features to construct the ground-truth visual tokens as output prediction for text-to-image generation [2].…”
Section: Image-and-text Representationsmentioning
confidence: 99%
See 1 more Smart Citation
“…(1) Task-specific uni-directional architectures [2,6,53] on bi-directional image and text generation tasks. Our taskagnostic bi-directional architecture as show in (2) releases design efforts of task-specific architectures in (1).…”
mentioning
confidence: 99%