Geometry-Guided Dense Perspective Network for Speech-Driven Facial Animation

Liu, Jingying; Hui, Binyuan; Li, Kun; Liu, Yunke; Lai, Yuekun; Zhang, Yuxiang; Liu, Yebin; Jing-yu, Yang

doi:10.1109/tvcg.2021.3107669

Cited by 17 publications

(10 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compared the results with current state-of-the-art methods [13,15,22] on the VOCASET dataset, directly using the data provided in these papers. Evaluation of mouth synchronization.…”

Section: Experimental Results and Analysismentioning

confidence: 99%

“…There are several methods [10][11][12] to obtain 3D facial parameter representations from 2D monocular videos, but the quality of the synthesized 3D data receives limitations in the accuracy of 3D reconstruction techniques and 3D reconstruction techniques cannot realize subtle changes in 3D based on 2D videos, so this may lead to unreliable results. In works that generate 3D facial animations based on 3D meshes [13][14][15], they delay the speech input to short audio windows, which may lead to pauses in lip movements with speech changes, which further may affect the realistic facial changes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

3D head-talk: speech synthesis 3D head movement face animation

Yang

et al. 2022

Preprint

View full text Add to dashboard Cite

Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3DHead-Talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods. And stabilizes the non-area region while maintaining the appearance of extreme head motion.

show abstract

“…We compared the results with current state-of-the-art methods [13,15,22] on the VOCASET dataset, directly using the data provided in these papers. Evaluation of mouth synchronization.…”

Section: Experimental Results and Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

3D head-talk: speech synthesis 3D head movement face animation

Yang

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The critical contribution of VOCA is that the additional identity control parameters can vary the identity-dependent visual dynamics. Based on VOCA, Liu et al [186] proposed a geometry-guided dense perspective network (GDPnet) with two constraints from different perspectives to achieve a more robust generation. Fan et al [187] proposed a Transformer-based autoregressive VSG model named FaceFormer to encode the long-term audio context information and predict a sequence of 3D face vertices.…”

Section: Vertex Based Methodsmentioning

confidence: 99%

Deep Learning for Visual Speech Analysis: A Survey

Sheng¹,

Kuang²,

Bai³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the development of visual speech learning. Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper aims to present a comprehensive review of recent progress in deep learning methods on visual speech analysis. We cover different aspects of visual speech, including fundamental problems, challenges, benchmark datasets, a taxonomy of existing methods, and state-of-the-art performance. Besides, we also identify gaps in current research and discuss inspiring future research directions.

show abstract

“…However, these methods are not applicable to 3D character models that are widely used in 3D games and virtual reality interactions. Therefore, speech-driven 3D facial animation has attracted more attention recently [2,15,6,41,12,35,23,7,5].…”

Section: Speech-driven 3d Facial Animationmentioning

confidence: 99%

EmoTalk: Speech-driven emotional disentanglement for 3D face animation

Peng¹,

Wu²,

Xu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: https://ziqiaopeng. github.io/emotalk

show abstract

Geometry-Guided Dense Perspective Network for Speech-Driven Facial Animation

Cited by 17 publications

References 41 publications

3D head-talk: speech synthesis 3D head movement face animation

3D head-talk: speech synthesis 3D head movement face animation

Deep Learning for Visual Speech Analysis: A Survey

EmoTalk: Speech-driven emotional disentanglement for 3D face animation

Contact Info

Product

Resources

About