Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

Eskimez, Şefik Emre; Zhang, You; Duan, Zhiyao

doi:10.48550/arxiv.2008.03592

Cited by 3 publications

(1 citation statement)

References 59 publications

(61 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another recent end-to-end system for talking face generation from noisy speech has been studied with image quality and mouth-shape synchronization, which is attained by a mouth region mask loss (MRM) [8]. In their following work [9], an end-to-end talking face generation system receives a reference face image, a speech utterance, and a categorical emotion label to generate a talking face video in sync with the speech and expressing the conditioned emotion. They discard the synchronization discriminator from their previous work and keep only the MRM loss for the mouth movements.…”

Section: Related Workmentioning

confidence: 99%

Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation

Kesim¹,

Erzin

2021

Interspeech 2021

View full text Add to dashboard Cite

Talking head generation is an active research problem. It has been widely studied as a direct speech-to-video or two stage speech-to-landmarks-to-video mapping problem. In this study, our main motivation is to assess individual and joint contributions of the speech and facial landmarks to the talking head generation quality through a state-of-the-art generative adversarial network (GAN) architecture. Incorporating frame and sequence discriminators and a feature matching loss, we investigate performances of speech only, landmark only and joint speech and landmark driven talking head generation on the CREMA-D dataset. Objective evaluations using the peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) and landmark distance (LMD) indicate that while landmarks bring PSNR and SSIM improvements to the speech driven system, speech brings LMD improvement to the landmark driven system. Furthermore, feature matching is observed to improve the speech driven talking head generation models significantly.

show abstract

Section: Related Workmentioning

confidence: 99%