2020
DOI: 10.48550/arxiv.2008.03592
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

Abstract: Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video in sync with the speech and expressing the condition emotion. Objective evaluation on image quality, audiovisual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 59 publications
(61 reference statements)
0
1
0
Order By: Relevance
“…Another recent end-to-end system for talking face generation from noisy speech has been studied with image quality and mouth-shape synchronization, which is attained by a mouth region mask loss (MRM) [8]. In their following work [9], an end-to-end talking face generation system receives a reference face image, a speech utterance, and a categorical emotion label to generate a talking face video in sync with the speech and expressing the conditioned emotion. They discard the synchronization discriminator from their previous work and keep only the MRM loss for the mouth movements.…”
Section: Related Workmentioning
confidence: 99%
“…Another recent end-to-end system for talking face generation from noisy speech has been studied with image quality and mouth-shape synchronization, which is attained by a mouth region mask loss (MRM) [8]. In their following work [9], an end-to-end talking face generation system receives a reference face image, a speech utterance, and a categorical emotion label to generate a talking face video in sync with the speech and expressing the conditioned emotion. They discard the synchronization discriminator from their previous work and keep only the MRM loss for the mouth movements.…”
Section: Related Workmentioning
confidence: 99%