2021
DOI: 10.48550/arxiv.2105.14211
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers

Abstract: Conditional image synthesis aims to create an image according to some multi-modal guidance in the forms of textual descriptions, reference images, and image blocks to preserve, as well as their combinations. In this paper, instead of investigating these control signals separately, we propose a new two-stage architecture, UFC-BERT, to unify any number of multi-modal controls. In UFC-BERT, both the diverse control signals and the synthesized image are uniformly represented as a sequence of discrete tokens to be … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 10 publications
(20 citation statements)
references
References 54 publications
0
20
0
Order By: Relevance
“…Our framework for multimodal video generation is a two-stage image generation method. It uses discrete feature representations [19,36,46,76]. During the first stage we train an autoencoder (with encoder E and decoder D) that has the same architecture as the one from VQGAN [19] to obtain a quantized representation for images.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Our framework for multimodal video generation is a two-stage image generation method. It uses discrete feature representations [19,36,46,76]. During the first stage we train an autoencoder (with encoder E and decoder D) that has the same architecture as the one from VQGAN [19] to obtain a quantized representation for images.…”
Section: Methodsmentioning
confidence: 99%
“…Unlike existing transformerbased video generation works that focus on autoregressive training, we apply a non-autoregressive generation pipeline with a bidirectional transformer [20,21,23,33,65]. Our work is inspired by M6-UFC [76], which utilizes the nonautoregressive training for multimodal image generation and produces more diverse image generation with higher quality. Building upon M6-UFC, we further introduce training techniques for multimodal video synthesis.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…A possible solution is to discretize text, image, and object into a unified output vocabulary. Recent advances in image quantization [51,52] has demonstrated effectiveness in text-to-image synthesis [18,19,48,49], and thus we utilize this strategy for the target-side image representations. Sparse coding is effective in reducing the sequence length of image representation.…”
Section: I/o and Architecturementioning
confidence: 99%