StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Yin, Fei; Zhang, Yong; Cun, Xiaodong; Cao, Mingdeng; Fan, Yanbo; Wang, Xuan; Bai, Qingyan; Wang, Jue; Yang, Yujiu

doi:10.1007/978-3-031-19790-1_6

Cited by 49 publications

(43 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Only a few works target the disentanglement of pose and expression for talking face generation. Almost all of them [6,24,37] are based on 3DMMs that explicitly decouple pose and expression. PIRenderer [24] extracts the 3DMM parameters for a driving face through a pre-trained model and then predict the flow given a source face and the 3DMM parameters.…”

Section: Decouplingmentioning

confidence: 99%

“…During inference, it can transfer only the expression from the driving face by replac-ing the expression parameters of the source face with those of the driving one. StyleHEAT [37] follows the similar way based on a pre-trained StyleGAN. However, the performance of these methods heavily depend on the accuracy of 3DMMs.…”

Section: Decouplingmentioning

confidence: 99%

“…Their disentanglement are almost based on the pre-defined 3DMMs while our method is a self-supervised disentanglement without using 3DMMs. We compare with two state-of-the-art methods that are opensourced, i.e., PIRender [24] and StyleHEAT [37]. Since they use the 3DMM parameters as an input to generate warping flow, they perform independent editing by replacing the pose or expression parameters of the source with those of the driving one.…”

Section: Disentanglement For Video Portrait Editingmentioning

confidence: 99%

“…One challenge to disentangle pose and expression is the lack of paired data, such as the same pose but different expressions, or vice versa. In the literature, there are only a few exceptions that can get rid of this limitation, e.g., PIRenderer [24] and StyleHEAT [37], which are based on 3D Morphable Models (3DMMs) [2], a predefined parametric representation that decomposes expression, pose, and identity. However, the 3DMM-based methods heavily depend on the decoupling accuracy of the 3DMM parameters, which is far from satisfactory to reconstruct facial details due to the limited number of Blendshapes.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

Pang¹,

Zhang²,

Quan³

et al. 2023

Preprint

View full text Add to dashboard Cite

module, a pose generator, and an expression generator. The editing module projects faces into a latent space where pose motion and expression motion can be disentangled, and the pose or expression transfer can be performed in the latent space conveniently via addition. The two generators render the modified latent codes to images, respectively. Moreover, to guarantee the disentanglement, we propose a bidirectional cyclic training strategy with well-designed constraints. Evaluations demonstrate our method can control pose or expression independently and be used for general video editing.

show abstract

Section: Decouplingmentioning

confidence: 99%

Section: Decouplingmentioning

confidence: 99%

Section: Disentanglement For Video Portrait Editingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

Pang¹,

Zhang²,

Quan³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…ATVG [Chen et al 2019] and MakeItTalk [Zhou et al 2020] first generate the facial landmarks from audio, and then, render the video using a landmark-to-video network. Dense flow field is another active research direction [Siarohin et al 2019;Yin et al 2022]. [Zhang et al 2021a] predict the 3DMM coefficients from audio and then transfer these parameters into a flow-based warping network.…”

Section: Audio-based Single Image Facial Animationmentioning

confidence: 99%

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Cheng

Cun

Zhang

et al. 2022

SIGGRAPH Asia 2022 Conference Papers

View full text Add to dashboard Cite

We present VideoReTalking, a new system to edit the faces of a real-world talking head video according to input audio, producing a high-quality and lipsyncing output video even with a different emotion. Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for improving photo-realism. Given a talking-head video, we first modify the expression of each frame according to the same expression template

show abstract

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Liang,

Wang,

Chen

et al. 2023

Computer Animation & Virtual

View full text Add to dashboard Cite

Talking head generation aims to synthesize a photo‐realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio‐visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this paper, we propose Wav2Lip‐HR, a neural‐based audio‐driven high‐resolution talking head generation method. With our technique, all required to generate a clear high‐resolution lip sync talking video is an image/video of the target face and an audio clip of any speech. The primary benefit of our method is that it generates clear high‐resolution videos with sufficient facial details, rather than the ones just be large‐sized with less clarity. We first analyze key factors that limit the clarity of generated videos and then put forth several important solutions to address the problem, including data augmentation, model structure improvement and a more effective loss function. Finally, we employ several efficient metrics to evaluate the clarity of images generated by our proposed approach as well as several widely used metrics to evaluate lip‐sync performance. Numerous experiments demonstrate that our method has superior performance on visual quality and lip synchronization when compared to other existing schemes.

show abstract

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Cited by 49 publications

References 34 publications

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Contact Info

Product

Resources

About