Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Preechakul, Konpat; Chatthee, Nattanat; Wizadwongsa, Suttisak; Suwajanakorn, Supasorn

doi:10.1109/cvpr52688.2022.01036

Cited by 130 publications

(113 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this setting, we shift and scale hidden states of the UNet with the information not only from the time encoding but the audio embedding as well. We found this approach works better in comparison to other conditioning methods, such as using just an additional scale on top of Equation ( 14) [25], and applying a multihead attention mechanism with queries being a function of the audio embedding [30].…”

Section: Speech Conditioningmentioning

confidence: 92%

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Stypułkowski¹,

Vougioukas²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Section: Speech Conditioningmentioning

confidence: 92%

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Stypułkowski¹,

Vougioukas²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Diffusion autoencoders were first introduced by Preechakul et al (2022), as a way to condition the diffusion process on a compressed latent vector of the input itself. Diffusion can act as a more powerful generative decoder, and hence the input can be reduced to latents with higher compression ratios.…”

Section: Diffusion Magnitude-autoencoding (Dmae)mentioning

confidence: 99%

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Schneider¹,

Jin²,

Schölkopf³

2023

Preprint

View full text Add to dashboard Cite

The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media synthesis. One area that has yet to be fully explored is the application of diffusion models to music generation. Music generation requires to handle multiple aspects, including the temporal dimension, long-term structure, multiple layers of overlapping sounds, and nuances that only trained listeners can detect. In our work, we investigate the potential of diffusion models for text-conditional music generation. We develop a cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU. In addition to trained models, we provide a collection of open-source libraries with the hope of facilitating future work in the field. 1

show abstract

“…Denoising diffusion models [60,64] have seen great success on a wide variety of different challenges, ranging from image2image translation tasks like inpainting, colorisation, image upscaling, uncropping [6,26,41,42,50,53,57,59], audio generation [11,28,33,35,38,48,67,80], text-based image generation [4,21,23,46,51,55,58], video generation [24,27,82,86], and many others. For a thorough review on diffusion models and all of their recent applications, we recommend [81].…”

Section: Diffusion Modelsmentioning

confidence: 99%

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Bigioi¹,

Basak²,

Jordan³

et al. 2023

Preprint

View full text Add to dashboard Cite

In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing. 1

show abstract

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Cited by 130 publications

References 9 publications

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Contact Info

Product

Resources

About