End-To-End Generation of Talking Faces from Noisy Speech

Eskimez, Şefik Emre; Maddox, Ross K.; Xu, Chenliang; Duan, Zhiyao

doi:10.1109/icassp40776.2020.9054103

Cited by 22 publications

(18 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, a natural progression of this work will be to perform on-line experiments with noise-hardened versions of the synthesizer, such as that proposed by Eskimez et al (2020) . Further studies will also look at improving the synthesizer model through the implementation of targeted loss models, informed by the findings of the confusion matrix analysis presented here.…”

Section: Discussionmentioning

confidence: 99%

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Varano

Vougioukas

et al. 2022

Front. Neurosci.

View full text Add to dashboard Cite

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.

show abstract

Section: Discussionmentioning

confidence: 99%

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Varano

Vougioukas

et al. 2022

Front. Neurosci.

View full text Add to dashboard Cite

show abstract

“…Figure 1 shows the system overview, which employs the generative adversarial network (GAN) framework. Our generator network architecture is built based on our previous work [21], with a modification to accept the emotion condition input. For discriminator networks, we use one discriminator to distinguish the emotions expressed in videos, and another discriminator to distinguish the real and generated video frames.…”

Section: Methodsmentioning

confidence: 99%

“…They further improved their methods with three discriminators [10] that focus on improving the realness of video frames, the continuity between generated frames, and the synchronization between audio and visual data. Eskimez et al [21] proposed an end-to-end talking face generation system that is robust to noisy speech input. The system contains a frame discriminator to improve image quality and a pair discriminator to improve lip-speech synchronization.…”

Section: A Emotional Talking Face Generationmentioning

confidence: 99%

“…1) Speech Encoder: The speech encoder processes the input speech waveform and outputs a speech embedding. It follows the original implementation of [21]…”

Section: A Generatormentioning

confidence: 99%

“…2) Image Encoder: The image encoder computes an image embedding from the input condition face image. The architecture follows the original implementation without any modification [21]. It contains six layers of 2-D convolutional layers with the following number of filters, kernel sizes, and down-sampling factors: (64, 3, 2), (128, 3, 2), (256, 3, 2), (512, 3, 2), (512, 3, 2), (512, 4, 1), respectively.…”

Section: Wgan-gpmentioning

confidence: 99%

See 2 more Smart Citations

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

Eskimez

Zhang

Duan

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video in sync with the speech and expressing the condition emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a stateof-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a pilot study on human emotion recognition of generated videos with mismatched emotions between the audio and visual modalities, and results show that humans reply on the visual modality more significantly than the audio modality on this task.

show abstract