Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Hong, Fa-Ting; Zhang, Longhao; Shen, Li; Xu, Dan

doi:10.1109/cvpr52688.2022.00339

Cited by 66 publications

(36 citation statements)

References 105 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Talking head generation works can be broadly classified in three categories based on the type of input they use to generate a talking head: Text-driven [16,33,36], Audio-driven [9,13,18,31,37,43,45], and Video-driven [12,27,29,39,44] Talking Head Generation.…”

Section: Related Workmentioning

confidence: 99%

“…The motion field was used to calculate dense flow and warp the source frame in a latent space. Several other works [39,12] followed the same principle and added supplementary components to improve the quality. Face-vid2vid [39] used keypoint information in a 3D space, taking care of head rotation, among other things.…”

Section: Related Workmentioning

confidence: 99%

“…Face-vid2vid [39] used keypoint information in a 3D space, taking care of head rotation, among other things. DA-GAN [12] further added depth-aware attention to provide dense 3D facial geometry to guide the generation of motion fields. A similar approach in Motion-Representation-in-Articulated-Animation [30] uses key regions instead of keypoints to generate the warpable motion field.…”

Section: Related Workmentioning

confidence: 99%

“…Multiple publications have improved the quality of the generations. Existing works on talking head generation generally use a single modality, i.e., either visual [12,29,39,40] or audio features [13,37,31]. Audio-driven talking head generation models are good at generating quality lipsync; however, they have a serious drawback in handling non-verbal cues.…”

Section: Introductionmentioning

confidence: 99%

“…The video-driven methods heavily rely on the disentanglement of motion from the appearance [17]. These methods generally use key points as an intermediate representation [29,12,39] and try to align the detected key points of source and driving frames. These works learn key points in an unsupervised manner and fail to focus on specific regions of the face.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Audio-Visual Face Reenactment

Agarwal¹,

Mukhopadhyay²,

Namboodiri³

et al. 2022

Preprint

View full text Add to dashboard Cite

Figure 1: We propose AVFR-GAN, a novel method for face reenactment. Our network takes a source identity, a driving frame, and a small audio chunk associated with the driving frame to animate the source identity according to the driving frame. Our network generates highly realistic outputs compared to previous works like [29] and [30]. Results from our network contain significantly fewer artifacts and handle things like mouth movements, eye movements, etc. in a better manner.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Audio-Visual Face Reenactment

Agarwal¹,

Mukhopadhyay²,

Namboodiri³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Liang,

Wang,

Chen

et al. 2023

Computer Animation & Virtual

View full text Add to dashboard Cite

Talking head generation aims to synthesize a photo‐realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio‐visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this paper, we propose Wav2Lip‐HR, a neural‐based audio‐driven high‐resolution talking head generation method. With our technique, all required to generate a clear high‐resolution lip sync talking video is an image/video of the target face and an audio clip of any speech. The primary benefit of our method is that it generates clear high‐resolution videos with sufficient facial details, rather than the ones just be large‐sized with less clarity. We first analyze key factors that limit the clarity of generated videos and then put forth several important solutions to address the problem, including data augmentation, model structure improvement and a more effective loss function. Finally, we employ several efficient metrics to evaluate the clarity of images generated by our proposed approach as well as several widely used metrics to evaluate lip‐sync performance. Numerous experiments demonstrate that our method has superior performance on visual quality and lip synchronization when compared to other existing schemes.

show abstract