Video Diffusion Models

Ho, Jonathan; Salimans, Tim; Gritsenko, Alexey A.; Chan, William; Norouzi, Mohammad; Fleet, David J.

doi:10.48550/arxiv.2204.03458

Cited by 74 publications

(102 citation statements)

References 26 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One way to implement this would be to train two different models operating at the two different temporal resolutions. In the language of Ho et al [15], who use a similar approach, sampling would be carried out in the first three stages by a "frameskip-2" model and, in the remaining stages, by a "frameskip-1" model. Both this approach and the autoregressive approach are examples of what we call sampling schemes.…”

Section: Sampling Long Videosmentioning

confidence: 99%

“…Although related work has demonstrated modeling of short photorealistic videos (e.g. 30 frames [34], 48 frames [6] or 64 frames [15]), generating longer videos that are both coherent and photo-realistic remains an open challenge. A major difficulty is scaling: photorealistic image generative models [4,8] are already close to the memory and processing limits of modern hardware.…”

Section: Introductionmentioning

confidence: 99%

“…The question we ask is: given an explicit limit K on the number of video frames we can jointly model, how can we best allocate these frames to generate a video of length N > K? One option is to use the previously-described autoregressive model but, if K = N/4, we could instead follow Ho et al [15] by training two models: one which first samples every 4th frame in the video, and another which (in multiple stages) infills the remaining frames conditioned on those. To enable Figure 1: A long video (25 minutes, or approximately 15 000 frames) generated by FDM for each of CARLA Town01 and MineRL, conditioned on 500 and 250 prior frames respectively.…”

Section: Introductionmentioning

confidence: 99%

“…Contributions (1) At the highest level, we claim to have concurrently developed one of the first denoising diffusion probabilistic model (DDPM)-based video generative models [15,38]. To do so we augment a previously-used DDPM image architecture [14,21] with a temporal attention mechanism including a novel relative (frame) position encoding network.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Flexible Diffusion Modeling of Long Videos

Harvey¹,

Naderiparizi²,

Masrani³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA self-driving car simulator. * Frank Wood is also affiliated with the Montréal Institute for Learning Algorithms (Mila) and Inverted AI.Preprint. Under review.

show abstract

Section: Sampling Long Videosmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Flexible Diffusion Modeling of Long Videos

Harvey¹,

Naderiparizi²,

Masrani³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Diffusion Models for Image Synthesis. Starting with the seminal works of Sohl-Dickstein et al [52] and Ho et al [21], diffusion-based generative models have improved generative modeling of artificial visual systems [11,31,61,23,64,46] and other data [32,24,62] by sequentially removing noise from a random signal to generate an image. Being likelihood-based models, they achieve high data distribution coverage with well-behaved optimization properties while producing high resolution images at unprecedented quality.…”

Section: Related Workmentioning

confidence: 99%

Retrieval-Augmented Diffusion Models

Blattmann¹,

Rombach²,

Oktay³

et al. 2022

Preprint

View full text Add to dashboard Cite

Generative image synthesis with diffusion models has recently achieved excellent visual quality in several tasks such as text-based or class-conditional image synthesis. Much of this success is due to a dramatic increase in the computational capacity invested in training these models. This work presents an alternative approach: inspired by its successful application in natural language processing, we propose to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database. During training, our diffusion model is trained with similar visual features retrieved via CLIP and from the neighborhood of each training instance. By leveraging CLIP's joint image-text embedding space, our model achieves highly competitive performance on tasks for which it has not been explicitly trained, such as class-conditional or text-image synthesis, and can be conditioned on both text and image embeddings. Moreover, we can apply our approach to unconditional generation, where it achieves state-of-the-art performance. Our approach incurs low computational and memory overheads and is easy to implement. We discuss its relationship to concurrent work and will publish code and pretrained models soon.

show abstract

Accelerating the design of the effective surface of pressing tools with probabilistic inverse modeling approaches

Hupfeld,

Teshima,

Ali

et al. 2024

Proc Appl Math and Mech

View full text Add to dashboard Cite

In the design and production of press part components, the tool development process is an essential step to fulfil the required quality criteria. Conventionally, this is achieved based on expert knowledge to design the effective surface with respect to the multitude of physical, procedural, and human influences. Thus, several iterations in the tool development process are usually required, which are costly and can lead to bottlenecks within product design cycles. To accelerate this tool design, we propose a diffusion model architecture to inversely design the necessary effective tool surface given a desired geometry of the press part. This diffusion model is able to reduce the generalization issues of classical machine learning approaches by leveraging the attention mechanism both in the spatial and temporal dimension of the underlying forming process. The applicability of a similar diffusion model has already been shown in previous applications for the inverse‐design of metamaterials and this work further demonstrates diffusion models as a suitable model candidate for the inverse‐design of 3D‐geometries. For model training, finite element simulations containing the time series of deformation states during the forming process were used. Furthermore, different geometry variations of part and tool as well as relevant press process parameters were used in the training. With the procedure demonstrated in this study, a future‐oriented support for the tool development process has been shown, enabling further developments towards a time‐ and cost‐efficient production of press tools.

show abstract

Video Diffusion Models

Cited by 74 publications

References 26 publications

Flexible Diffusion Modeling of Long Videos

Flexible Diffusion Modeling of Long Videos

Retrieval-Augmented Diffusion Models

Accelerating the design of the effective surface of pressing tools with probabilistic inverse modeling approaches

Contact Info

Product

Resources

About