A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Prajwal, K R; Mukhopadhyay, Rudrabha; Namboodiri, Vinay P.; Jawahar, C. V.

doi:10.1145/3394171.3413532

Cited by 440 publications

(391 citation statements)

References 22 publications

(37 reference statements)

Supporting

Mentioning

341

Contrasting

Unclassified

Order By: Relevance

“…It can be seen that our method reaches the best under most of the metrics on both datasets. On LRW, though Wav2Lip [44] outperforms our method given two metrics, the reason is In the top row are the audio-synced videos. ATVG [10] are accurate on the left.…”

Section: Quantitative Evaluationmentioning

confidence: 87%

See 1 more Smart Citation

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Zhou

Liu

et al. 2019

AAAI

375

274

View full text Add to dashboard Cite

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

show abstract

Section: Quantitative Evaluationmentioning

confidence: 87%

“…Learning Speech Content Space. It has been verified that learning the natural synchronization between visual mouth movements and auditory utterances is valuable for driving images to speak [72,44]. Thus embedding space that contains synchronized audio-visual features as the speech content space.…”

Section: Modularization Of Representationsmentioning

confidence: 99%

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Zhou

Liu

et al. 2019

AAAI

375

274

View full text Add to dashboard Cite

show abstract

“…Another category of Lip Synchronisation is non-constraint methods. We have studied here two non-constraint methods which are LipGAN [24] and wav2lip [5]. Wav2lip is speaker independent.…”

Section: Discussionmentioning

confidence: 99%

“…The model works quite well for not that dynamic videos. Upon investigating [5] the reason was inadequate discriminator loss function.…”

Section: Unconstrained Talking Face Generation From Speechmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation

Kadam¹,

Rane²,

Mishra³

et al. 2021

EAI Endorsed Transactions on Creative Technologies

View full text Add to dashboard Cite

The fields like Media, Education and Corporations etc have started focusing on content creation. This has led to the huge demand for synthetic media generation using less data. To synthesize a high-grade artificial video, the lip must be synchronized with the audio. Here we have compared the various methods for voice-cloning and lip synchronization. Voice cloning procedure include state of the art methods like wavenet and other text-to-speech approaches. Lip synchronization methods describe constrained and unconstrained methods. Various recent research like LipGan, Wav2Lip are discussed. The methods are compared and the best method is suggested. Apart from studying and comparing the various methods, their drawbacks, future scopes, and application are also there. Different social and ethical issues are also discussed.

show abstract

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Liang,

Wang,

Chen

et al. 2023

Computer Animation & Virtual

View full text Add to dashboard Cite

Talking head generation aims to synthesize a photo‐realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio‐visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this paper, we propose Wav2Lip‐HR, a neural‐based audio‐driven high‐resolution talking head generation method. With our technique, all required to generate a clear high‐resolution lip sync talking video is an image/video of the target face and an audio clip of any speech. The primary benefit of our method is that it generates clear high‐resolution videos with sufficient facial details, rather than the ones just be large‐sized with less clarity. We first analyze key factors that limit the clarity of generated videos and then put forth several important solutions to address the problem, including data augmentation, model structure improvement and a more effective loss function. Finally, we employ several efficient metrics to evaluate the clarity of images generated by our proposed approach as well as several widely used metrics to evaluate lip‐sync performance. Numerous experiments demonstrate that our method has superior performance on visual quality and lip synchronization when compared to other existing schemes.

show abstract

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

Cited by 440 publications

References 22 publications

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

A Survey of Audio Synthesis and Lip-syncing for Synthetic Video Generation

Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Contact Info

Product

Resources

About