M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers

Zhang, Zhu; Ma, Jing; Zhou, Chen; Men, Rui; Li, Zhikang; Ding, Ming; Tang, Jie; Zhou, Jingren; Yang, Hongxia

doi:10.48550/arxiv.2105.14211

Cited by 10 publications

(20 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our framework for multimodal video generation is a two-stage image generation method. It uses discrete feature representations [19,36,46,76]. During the first stage we train an autoencoder (with encoder E and decoder D) that has the same architecture as the one from VQGAN [19] to obtain a quantized representation for images.…”

Section: Methodsmentioning

confidence: 99%

“…Unlike existing transformerbased video generation works that focus on autoregressive training, we apply a non-autoregressive generation pipeline with a bidirectional transformer [20,21,23,33,65]. Our work is inspired by M6-UFC [76], which utilizes the nonautoregressive training for multimodal image generation and produces more diverse image generation with higher quality. Building upon M6-UFC, we further introduce training techniques for multimodal video synthesis.…”

Section: Related Workmentioning

confidence: 99%

“…Inspired by M6-UFC [76] and VIMPAC [55], we consider five masking strategies: (I) i.i.d. masking, i.e., randomly masking video tokens according to a Bernoulli distribution;…”

Section: Masked Sequence Modeling With Relevancementioning

confidence: 99%

“…The first phase obtains discrete representations from images. We employ an autoencoder with a quantized bottleneck, inspired by the recent success of two-stage image generation using quantized feature representations [19,36,46,76]. The second phase learns to generate video representations that are conditioned on the input modalities, which can then be decoded into videos using the decoder from the first stage.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Han¹,

Ren²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four 1 .

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…Inspired by M6-UFC [76] and VIMPAC [55], we consider five masking strategies: (I) i.i.d. masking, i.e., randomly masking video tokens according to a Bernoulli distribution;…”

Section: Masked Sequence Modeling With Relevancementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Han¹,

Ren²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A possible solution is to discretize text, image, and object into a unified output vocabulary. Recent advances in image quantization [51,52] has demonstrated effectiveness in text-to-image synthesis [18,19,48,49], and thus we utilize this strategy for the target-side image representations. Sparse coding is effective in reducing the sequence length of image representation.…”

Section: I/o and Architecturementioning

confidence: 99%

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Wang¹,

An²,

Men³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-tosequence learning framework based on the encoder-decoder architecture. OFA performs pretraining and finetuning with task instructions and introduces no extra task-specific layers for finetuning. Experimental results show that OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std acc.: 80.02), SNLI-VE (test acc.: 90.20), and referring expression comprehension (RefCOCO / RefCOCO+ / RefCOCOg test acc.: 92.93 / 90.10 / 85.20). Through extensive analyses, we demonstrate that OFA reaches comparable performance with uni-modal pretrained models (e.g., BERT, MAE, MoCo v3, SimCLR v2, etc.) in uni-modal tasks, including NLU, NLG, and image classification, and it effectively transfers to unseen tasks and domains. Code shall be released soon at https://github.com/OFA-Sys/OFA.

show abstract

DEC-205 receptor targeted poly(lactic-co-glycolic acid) nanoparticles containing Eucommia ulmoides polysaccharide enhances the immune response of foot-and-mouth disease vaccine in mice

Feng

Fan

et al. 2023

International Journal of Biological Macromolecules

View full text Add to dashboard Cite

M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers

Cited by 10 publications

References 54 publications

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

DEC-205 receptor targeted poly(lactic-co-glycolic acid) nanoparticles containing Eucommia ulmoides polysaccharide enhances the immune response of foot-and-mouth disease vaccine in mice

Contact Info

Product

Resources

About