Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Han, Ligong; Ren, Jinchang; Lee, Hsin-Ying; Barbieri, Francesco; Olszewski, Kyle; Minaee, Shervin; Metaxas, Dimitris N.; Tulyakov, Sergey

doi:10.1109/cvpr52688.2022.00360

Cited by 26 publications

(49 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To improve the decoding efficiency of autoregressive transformers, bidirectional generative transformers have been proposed [5,13,56]. Contrary to autoregressive models that predict a single consecutive token at each step, a bidirectional transformer learns to predict multiple masked tokens at once based on the previously generated context.…”

Section: Bidirectional Transformersmentioning

confidence: 99%

“…To (partially) address the issues, prior works proposed improved transformers for generative modeling of videos, which are categorized as the following: (a) Employing sparse attention to improve scaling during training [12,16,51], (b) Hierarchical approaches that employ separate models in different frame rates to generate long videos with a smaller computation budget [11,16], and (c) Removing autoregression by formulating the generative process as masked token prediction and training a bidirectional transformer [12,13]. While each approach is effective in addressing specific limitations in autoregressive transformers, none of them provides comprehensive solutions to aforementioned problems -(a, b) still inherits the problems in autoregressive inference and cannot leverage the long-term dependency by design due to the local attention window, and (c) is not appropriate to learn long-range dependency due to the quadratic computation cost.…”

Section: Introductionmentioning

confidence: 99%

“…Vector Quantization To map a video x into discrete tokens y, previous works [11,13,32,48,51] utilize vector quantization with an encoder E that maps x onto a learnable codebook F = {e i } U i=1 [45]. Specifically, given a video x, the encoder produces continuous embeddings h = E(x) ∈ R t×h×w×d and searches for the nearest code e u ∈ F .…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Yoo¹,

Kim²,

Lee³

et al. 2023

Preprint

View full text Add to dashboard Cite

Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive transformers for generating moderately long videos in both quality and speed. Videos and code are available at https://sites.google.com/view/mebt-cvpr2023.

show abstract

Section: Bidirectional Transformersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Yoo¹,

Kim²,

Lee³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In most existing GAN-based FAM methods, the target attributes are specified by labels [31], [32], [79] or exemplar images [39], [68], [80]. Recently, conditional information in other modalities, such text [198]- [202] and speech [203]- [205], has attracted increasing research attention due to the development of pre-trained large-scale frameworks (e.g., CLIP [206]) and availability of related datasets (CelebA-Dialog [207]). Moreover, novel modalities of supervision signal, such as biometrics (e.g., brain responses recorded via electroencephalography [208]) and sound [209], have also been utilized to learn feature representations for semantic editing.…”

Section: Challenges and Future Directionsmentioning

confidence: 99%

GAN-Based Face Attribute Editing

Liu

Cao

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Recently, a variety of methods using the Generative Adversarial Network (GAN) for face editing have been proposed. However, the existing methods cannot control the editing content of the face elements according to the user-specified attributes or need to train a conditional GAN for editing tasks, which means it is difficult to add new attributes in the future. In this paper, a method to edit face attributes by editing the latent variable with the help of a pre-trained unconditional GAN and a linear classification model is proposed. In particular, face attribute editing is divided into two separate stages: Firstly, based on the optimization function, the generative model does a latent variable search to generate a high-quality face image that is similar to the input image. Secondly, by editing the latent variable of the GAN, the attribute of the generated face image can be modified indirectly, so it is almost unaffected by the training process and network structure of GAN, which means it is a flexible method for nearly all GAN network. Images of the FFHQ dataset are edited by attribute labels defined in Celeba dataset for experiments. These experiments prove that our method can edit a variety of face images that vary with race, gender, age, and camera shooting angle. The overall quality of the edited image is not inferior to the other face attribute editing method, and attribute classification for edited image shows a 92.6% attribute editing success rate of the proposed method.

show abstract

“…To model the logic of games and game AI that determine the evolution of the environment states, we introduce an animation model. Specifically, inspired by [Han et al 2022], we train a non-autoregressive text-conditioned diffusion model which leverages masked sequence modeling to enable the fine-grained conditioning capabilities on which the above-mentioned applications are based. In particular, we show that using text labels describing actions happening in a game is instrumental in learning such capabilities.…”

Section: Introductionmentioning

confidence: 99%