Generating Talking Face Landmarks from Speech

Eskimez, Şefik Emre; Maddox, Ross K.; Xu, Chenliang; Duan, Zhiyao

doi:10.1007/978-3-319-93764-9_35

Cited by 43 publications

(15 citation statements)

References 20 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides leveraging intermediate landmarks for avoiding directly correlating speech audio with irrelevant visual dynamics, we also propose a novel dynamically adjustable loss along with an attention mechanism to enforce the network to focus on audiovisual-correlated regions. It is worth to mention that in a recent audio-driven facial landmarks generation work [8], such irrelevant visual dynamics are removed in the training process by normalizing and identityremoving the facial landmarks. This has been shown to result in more natural synchronization between generated mouth shapes and speech audio.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

Chen

Maddox

Duan³

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

293

311

View full text Add to dashboard Cite

We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and realworld samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.

show abstract

Section: Introductionmentioning

confidence: 99%

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

Chen

Maddox

Duan³

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

293

311

View full text Add to dashboard Cite

show abstract

“…Suwajanakorn et al [46] proposed an interesting technique to automatically edit a video of a given speaker with accurate lip synchronization guided by his own audio in a different speech. This work has spawned in recent years a number of variant methods on the task [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57].…”

Section: Related Workmentioning

confidence: 99%

Multi-modality Deep Restoration of Extremely Compressed Face Videos

Zhang¹,

Wu²

2021

Preprint

View full text Add to dashboard Cite

Arguably the most common and salient object in daily video communications is the talking head, as encountered in social media, virtual classrooms, teleconferences, news broadcasting, talk shows, etc. When communication bandwidth is limited by network congestions or cost effectiveness, compression artifacts in talking head videos are inevitable. The resulting video quality degradation is highly visible and objectionable due to high acuity of human visual system to faces. To solve this problem, we develop a multi-modality deep convolutional neural network method for restoring face videos that are aggressively compressed. The main innovation is a new DCNN architecture that incorporates known priors of multiple modalities: the video-synchronized speech signal and semantic elements of the compression code stream, including motion vectors, code partition map and quantization parameters. These priors strongly correlate with the latent video and hence they are able to enhance the capability of deep learning to remove compression artifacts. Ample empirical evidences are presented to validate the superior performance of the proposed DCNN method on face videos over the existing state-of-the-art methods.

show abstract

“…The presence of visual cues improves speech comprehension [1], [2], [3], [4] in noisy environments and for the hardof-hearing population. Consequently, researchers developed systems that can automatically generate talking faces from speech in order to provide the visual cues when they are not available [5], [6], [7], [8], [9], [10], [11], [12]. These systems can increase the accessibility of abundantly available audioonly resources for the hearing impaired population.…”

Section: Introductionmentioning

confidence: 99%

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

Eskimez

Zhang

Duan

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video in sync with the speech and expressing the condition emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a stateof-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a pilot study on human emotion recognition of generated videos with mismatched emotions between the audio and visual modalities, and results show that humans reply on the visual modality more significantly than the audio modality on this task.

show abstract

Generating Talking Face Landmarks from Speech

Cited by 43 publications

References 20 publications

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

Multi-modality Deep Restoration of Extremely Compressed Face Videos

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

Contact Info

Product

Resources

About