2022
DOI: 10.48550/arxiv.2204.03458
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video Diffusion Models

Abstract: Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
85
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 74 publications
(102 citation statements)
references
References 26 publications
(28 reference statements)
1
85
0
Order By: Relevance
“…One way to implement this would be to train two different models operating at the two different temporal resolutions. In the language of Ho et al [15], who use a similar approach, sampling would be carried out in the first three stages by a "frameskip-2" model and, in the remaining stages, by a "frameskip-1" model. Both this approach and the autoregressive approach are examples of what we call sampling schemes.…”
Section: Sampling Long Videosmentioning
confidence: 99%
See 3 more Smart Citations
“…One way to implement this would be to train two different models operating at the two different temporal resolutions. In the language of Ho et al [15], who use a similar approach, sampling would be carried out in the first three stages by a "frameskip-2" model and, in the remaining stages, by a "frameskip-1" model. Both this approach and the autoregressive approach are examples of what we call sampling schemes.…”
Section: Sampling Long Videosmentioning
confidence: 99%
“…Although related work has demonstrated modeling of short photorealistic videos (e.g. 30 frames [34], 48 frames [6] or 64 frames [15]), generating longer videos that are both coherent and photo-realistic remains an open challenge. A major difficulty is scaling: photorealistic image generative models [4,8] are already close to the memory and processing limits of modern hardware.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Diffusion Models for Image Synthesis. Starting with the seminal works of Sohl-Dickstein et al [52] and Ho et al [21], diffusion-based generative models have improved generative modeling of artificial visual systems [11,31,61,23,64,46] and other data [32,24,62] by sequentially removing noise from a random signal to generate an image. Being likelihood-based models, they achieve high data distribution coverage with well-behaved optimization properties while producing high resolution images at unprecedented quality.…”
Section: Related Workmentioning
confidence: 99%