Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Stypułkowski, Michał; Vougioukas, Konstantinos; He, Song; Zięba, Maciej; Petridis, Stavros; Pantić, Maja

doi:10.48550/arxiv.2301.03396

Cited by 3 publications

(3 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Related work on data-driven facial animation can be divided into two main categories-vision-based and speech-based where our focus is on the latter. There has been extensive work and research done in neural-rendering of talking head animations in 2D pixel space [Guo et al 2021;Lu et al 2021;Stypulkowski et al 2023;Wu et al 2021]. However, due to the limitations of rendered videos, which are not useful in 3D interactive applications, this research will address speech-driven facial animation in 3D space.…”

Section: Background and Related Workmentioning

confidence: 99%

Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans

Haque

2023

SIGGRAPH Asia 2023 Doctoral Consortium

View full text Add to dashboard Cite

This doctoral research focuses on generating expressive 3D facial animation for digital humans by studying and employing datadriven techniques. Face is the first point of interest during human interaction, and it is not any different for interacting with digital humans. Even minor inconsistencies in facial animation can disrupt user immersion. Traditional animation workflows prove realistic but time-consuming and labor-intensive that cannot meet the everincreasing demand for 3D contents in recent years. Moreover, recent data-driven approaches focus on speech-driven lip synchrony, leaving out facial expressiveness that resides throughout the face. To address the emerging demand and reduce production efforts, we explore data-driven deep learning techniques for generating controllable, emotionally expressive facial animation. We evaluate the proposed models against state-of-the-art methods and ground-truth, quantitatively, qualitatively, and perceptually. We also emphasize the need for non-deterministic approaches in addition to deterministic methods in order to ensure natural randomness in the non-verbal cues of facial animation. CCS CONCEPTS• Computing methodologies → Neural networks; Animation; • Human-centered computing → User studies.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans

Haque

2023

SIGGRAPH Asia 2023 Doctoral Consortium

View full text Add to dashboard Cite

show abstract

“…Additionally, several methods [33], [34] utilize the diffusion model [35] to produce synthetic image. Diffused Head [36] proposes an autoregressive diffusion model for talking face generation, which takes an image and an audio sequence as input and produces realistic head movements, facial expressions, and background preservation. DiffTalk [34] employs reference facial images and landmarks to facilitate personality-conscious general synthesis, adeptly producing high-resolution, audio-driven talking head videos for previously unseen identities, eliminating the need for fine-tuning.…”

Section: Talking Head Synthesismentioning

confidence: 99%

Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Hong

Zhang²,

Shen³

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Predominant techniques on talking head generation largely depend on 2D information, including facial appearances and motions from input face images. Nevertheless, dense 3D facial geometry, such as pixel-wise depth, plays a critical role in constructing accurate 3D facial structures and suppressing complex background noises for generation. However, dense 3D annotations for facial videos is prohibitively costly to obtain. In this work, firstly, we present a novel self-supervised method for learning dense 3D facial geometry (i.e., depth) from face videos, without requiring camera parameters and 3D geometry annotations in training. We further propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. Secondly, we design an effective geometry-guided facial keypoint estimation module, providing accurate keypoints for generating motion fields. Lastly, we develop a 3D-aware cross-modal (i.e., appearance and depth) attention mechanism, which can be applied to each generation layer, to capture facial geometries in a coarse-to-fine manner. Extensive experiments are conducted on three challenging benchmarks (i.e., VoxCeleb1, VoxCeleb2, and HDTF). The results demonstrate that our proposed framework can generate highly realistic-looking reenacted talking videos, with new state-of-the-art performances established on these benchmarks. The codes and trained models are publicly available on the GitHub project page.

show abstract

“…Additionally, DDPM adopts a progressive generation approach, allowing it to better maintain the global consistency and structure of images. Existing studies have confirmed that DDPM exhibits stronger capabilities compared to GANs [26], [27]. Therefore, applying DDPM to face image restoration tasks holds immense potential and promising prospects [28]- [31].…”

Section: Introductionmentioning

confidence: 99%

An Improved Face Image Restoration Method Based on Denoising Diffusion Probabilistic Models

Pang,

Mao,

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Image restoration is a crucial task in computer vision, aiming to fill in missing areas within an image to restore its integrity. Traditional methods fall short when dealing with intricate facial image restoration, often failing to produce high-quality results. Denoising Diffusion Probabilistic Models(DDPM), characterized by its diversity and stability, plays a significant role in the domain of facial image restoration. This study aims to explore a facial image restoration method based on DDPM, utilizing a pre-trained unconditional DDPM model to achieve more flexible facial image restoration. At the same time, this study found that when the total number of iterations in the resampling process is relatively low, the quality of the restored image is poor. Therefore, we propose a method to optimize the inversion process by combining progressive sampling with sample scheduling to improve the quality of the restored images, and conduct extensive experiments on the CelebA-HQ and FFHQ datasets. Comparisons with other methods demonstrate that our approach yields higher-quality results in facial image restoration. Our method achieved the best results in terms of PSNR and LPIPS metrics. For random masks, the accuracy of face recognition increased by 15.7% after the restoration of facial images. For central masks, the accuracy improved by 26%.

show abstract

Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation

Cited by 3 publications

References 33 publications

Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans

Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans

Depth-Aware Generative Adversarial Network for Talking Head Video Generation

An Improved Face Image Restoration Method Based on Denoising Diffusion Probabilistic Models

Contact Info

Product

Resources

About