2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00360
|View full text |Cite
|
Sign up to set email alerts
|

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
49
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(49 citation statements)
references
References 24 publications
0
49
0
Order By: Relevance
“…To improve the decoding efficiency of autoregressive transformers, bidirectional generative transformers have been proposed [5,13,56]. Contrary to autoregressive models that predict a single consecutive token at each step, a bidirectional transformer learns to predict multiple masked tokens at once based on the previously generated context.…”
Section: Bidirectional Transformersmentioning
confidence: 99%
See 2 more Smart Citations
“…To improve the decoding efficiency of autoregressive transformers, bidirectional generative transformers have been proposed [5,13,56]. Contrary to autoregressive models that predict a single consecutive token at each step, a bidirectional transformer learns to predict multiple masked tokens at once based on the previously generated context.…”
Section: Bidirectional Transformersmentioning
confidence: 99%
“…To (partially) address the issues, prior works proposed improved transformers for generative modeling of videos, which are categorized as the following: (a) Employing sparse attention to improve scaling during training [12,16,51], (b) Hierarchical approaches that employ separate models in different frame rates to generate long videos with a smaller computation budget [11,16], and (c) Removing autoregression by formulating the generative process as masked token prediction and training a bidirectional transformer [12,13]. While each approach is effective in addressing specific limitations in autoregressive transformers, none of them provides comprehensive solutions to aforementioned problems -(a, b) still inherits the problems in autoregressive inference and cannot leverage the long-term dependency by design due to the local attention window, and (c) is not appropriate to learn long-range dependency due to the quadratic computation cost.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In most existing GAN-based FAM methods, the target attributes are specified by labels [31], [32], [79] or exemplar images [39], [68], [80]. Recently, conditional information in other modalities, such text [198]- [202] and speech [203]- [205], has attracted increasing research attention due to the development of pre-trained large-scale frameworks (e.g., CLIP [206]) and availability of related datasets (CelebA-Dialog [207]). Moreover, novel modalities of supervision signal, such as biometrics (e.g., brain responses recorded via electroencephalography [208]) and sound [209], have also been utilized to learn feature representations for semantic editing.…”
Section: Challenges and Future Directionsmentioning
confidence: 99%
“…To model the logic of games and game AI that determine the evolution of the environment states, we introduce an animation model. Specifically, inspired by [Han et al 2022], we train a non-autoregressive text-conditioned diffusion model which leverages masked sequence modeling to enable the fine-grained conditioning capabilities on which the above-mentioned applications are based. In particular, we show that using text labels describing actions happening in a game is instrumental in learning such capabilities.…”
Section: Introductionmentioning
confidence: 99%