X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Cho, Jaemin; Lü, Jing; Schwenk, Dustin; Hajishirzi, Hannaneh; Kembhavi, Aniruddha

doi:10.18653/v1/2020.emnlp-main.707

Cited by 56 publications

(52 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Dazu gehörten beispielsweise die Objekterkennung und somit die Verbindung zwischen bildlichen und textlichen Darstellungen. Die semantische Analyse von Bildern oder Videos wird dabei bis heute beforscht [33,34]. Image GPT ist dabei ein aktuelles Beispiel, wie anhand von Bildern einem System Bezeichner, auch Labels genannt, angelernt werden können.…”

Section: Die Enkel Locards -Quo Vadis?unclassified

Die Enkel von Locard

Povalej¹,

Rittelmeier²,

Fähndrich

et al. 2021

Informatik Spektrum

View full text Add to dashboard Cite

ZusammenfassungDie seit Jahrhunderten verwendeten Methoden in der Forensik basieren auf der Annahme eines Austausches von Materie und Mustern. Durch die Digitalisierung sind diese Annahmen nur noch eingeschränkt gültig und werden hier erweitert und diskutiert. In dem Zusammenhang ist es erforderlich, den Spurenbegriff grundlegend zu überdenken. Gleichzeitig werfen der ständige technische Fortschritt und die immer größer werdende Flut von auszuwertenden Daten die Ermittlungsbehörden immer wieder zurück. Dieser Entwicklung ist nur durch Automatisierung Herr zu werden. Verfahren der Künstlichen Intelligenz können und werden die Ermittlungsbehörden zukünftig dabei zunehmend unterstützen.

show abstract

Section: Die Enkel Locards -Quo Vadis?unclassified

Die Enkel von Locard

Povalej¹,

Rittelmeier²,

Fähndrich

et al. 2021

Informatik Spektrum

View full text Add to dashboard Cite

show abstract

“…The Transformer accepts a sequence of image and text representations as inputs, encodes them to contextualized vector representations, and outputs image and text tokens. For text-toimage generation, we follow X-LXMERT [2] to use an GAN-based image generator to convert image tokens to a real scene image.…”

Section: Approach 21 Pipelinementioning

confidence: 99%

“…We use the original grid features as visual inputs for image-to-text generation tasks to reduce the loss of image information. We adopt discrete clustering features of the original features to construct the ground-truth visual tokens as output prediction for text-to-image generation [2].…”

Section: Image-and-text Representationsmentioning

confidence: 99%

“…During training, we sample a masking ratio from a uniform prior distribution ([0,1]) and randomly mask the percentage of target tokens for prediction. While in inference, we adopt a non-autoregressive sampling strategy (i.e., mask-predict-k strategy [2,3,5]). So only a few sampling steps (e.g., 4) are needed to generate all target tokens, which enable real-time inference.…”

Section: Training and Inference Strategymentioning

confidence: 99%

See 1 more Smart Citation

A Picture is Worth a Thousand Words

Huang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

A creative image-and-text generative AI system mimics humans' extraordinary abilities to provide users with diverse and comprehensive caption suggestions, as well as rich image creations. In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images. When users imagine an image and associate it with multiple captions, our system paints a rich image to reflect all captions faithfully. Likewise, when users upload an image, our system depicts it with multiple diverse captions. We propose a unified multi-modal framework to achieve this goal. Specifically, our framework jointly models image-and-text representations with a Transformer network, which supports rich image creation by accepting multiple captions as input. We consider the relations among input captions to encourage diversity in training and adopt a non-autoregressive decoding strategy to enable realtime inference. Based on these, our system supports both diverse captions and rich images generations. Our code is available online 1 . CCS CONCEPTS• Computing methodologies → Natural language generation; Image processing.

show abstract

“…(1) Task-specific uni-directional architectures [2,6,53] on bi-directional image and text generation tasks. Our taskagnostic bi-directional architecture as show in (2) releases design efforts of task-specific architectures in (1).…”

mentioning

confidence: 99%

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Huang

Xue

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online. CCS CONCEPTS• Computing methodologies → Natural language generation; Image processing.

show abstract

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Cited by 56 publications

References 45 publications

Die Enkel von Locard

Die Enkel von Locard

A Picture is Worth a Thousand Words

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Contact Info

Product

Resources

About