2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01034
|View full text |Cite
|
Sign up to set email alerts
|

Capture, Learning, and Synthesis of 3D Speaking Styles

Abstract: Input: speech signal and 3D templateOutput: 3D character animation Figure 1: Given an arbitrary speech signal and a static 3D face mesh as input (left), our model, VOCA outputs a realistic 3D character animation (right). Top: Winston Churchill. Bottom: Actor from Karras et al. [33]. See supplementary video. AbstractAudio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
246
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3
3

Relationship

1
9

Authors

Journals

citations
Cited by 231 publications
(250 citation statements)
references
References 63 publications
(78 reference statements)
3
246
0
1
Order By: Relevance
“…New commercial HMDs, such as the Vive Pro Eye, can enable correct rendering of the avatar motions, which may again improve the reported results. Additionally, lip-sync systems for avatar animation keep evolving and they are currently reaching human perception levels [6]. Hence we hypothesise that lip-sync will become an even more common form of facial animation.…”
Section: Discussionmentioning
confidence: 98%
“…New commercial HMDs, such as the Vive Pro Eye, can enable correct rendering of the avatar motions, which may again improve the reported results. Additionally, lip-sync systems for avatar animation keep evolving and they are currently reaching human perception levels [6]. Hence we hypothesise that lip-sync will become an even more common form of facial animation.…”
Section: Discussionmentioning
confidence: 98%
“…Some works aim to synthesize coherent dynamic 3D face videos of a fixed identity with the help of 3DMMs. These include works that synthesize 4D videos from a static 3D mesh paired with semantic label information [Bolkart and Wuhrer 2015a], and from a static 3D mesh and audio information [Cudeiro et al 2019].…”
Section: Correspondencementioning
confidence: 99%
“…VisemeNet [50], a threestage LSTM network is proposed to achieve real time audio-lip synchronization and can be seamlessly integrated into existing animation workflows. VOCA [13], which is trained on a unique 4D face dataset, takes any speech signal as input and realistically animates a wide range of adult faces. Methods based on computer graphics require collection and manipulation on complex head models.…”
Section: Talking Face Generationmentioning
confidence: 99%