Joint Learning of Facial Expression and Head Pose from Speech

Greenwood, David A.; Matthews, Iain; Laycock, Stephen D.

doi:10.21437/interspeech.2018-2587

Cited by 23 publications

(29 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Attention masks are used to focus on the most changing parts on the face, especially the lips. Greenwood et al [2018] jointly learnt facial expressions and head poses in terms of landmarks from a forked Bi-directional LSTM network. Most previous audio-to-face-animation work focused on matching speech content and left out style/identity information since the identity is usually bypassed due to mode collapse or averaging during training.…”

Section: Related Workmentioning

confidence: 99%

“…Last but not least, handling lip syncing and facial animation are not sufficient for the perception of realism of talking-heads. The entire facial expression considering the correlation between all facial elements and head pose also play an important role [Faigin 2012;Greenwood et al 2018]. These correlations, however, are less constrained by the audio and thus hard to be estimated.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MakeltTalk

et al. 2020

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

MakeltTalk

et al. 2020

View full text Add to dashboard Cite

show abstract

“…These approaches achieve highly realistic results, but they are typically personalized and are not audio-driven. Most fully speech-driven 3D face animation techniques require either personalized models [5,22,26] or map to lower fidelity blendshape models [24] or facial landmarks [12,16]. Cao et al [5] propose speech-driven animation of a realistic textured personalized 3D face model that requires mocap data from the person to be animated, offline processing and blending of motion snippets.…”

Section: Related Workmentioning

confidence: 99%

MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

Richard¹,

Zollhoefer²,

Wen³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents a generic method for generating full facial 3D animation from speech. Existing approaches to audio-driven facial animation exhibit uncanny or static upper face animation, fail to produce accurate and plausible co-articulation or rely on person-specific models that limit their scalability. To improve upon existing models, we propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. At the core of our approach is a categorical latent space for facial animation that disentangles audiocorrelated and audio-uncorrelated information based on a novel cross-modality loss. Our approach ensures highly accurate lip motion, while also synthesizing plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion. We demonstrate that our approach outperforms several baselines and obtains state-of-the-art quality both qualitatively and quantitatively. A perceptual user study demonstrates that our approach is deemed more realistic than the current state-of-the-art in over 75% of cases. We recommend watching the supplemental video before reading the paper:

show abstract

“…Head animation synthesis focuses on synthesizing head pose from input speech. Some works direct regress head pose with BiLSTM (Ding, Zhu, and Xie 2015;Greenwood, Matthews, and Laycock 2018) or the encoder of transformer (Vaswani et al 2017). More precisely, head pose generation from speech is a one-to-many mapping, Sadoughi and Busso (2018) employ GAN (Goodfellow et al 2014;Mirza and Osindero 2014;Yu et al 2019b,a) to retain the diversity.…”

Section: Facial Animation Synthesismentioning

confidence: 99%

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Li¹,

Wang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions in accordance with contextual sentiments as well as speech rhythm and pauses. To be specific, our framework consists of a speaker-independent stage and a speaker-specific stage. In the speaker-independent stage, we design three parallel networks to generate animation parameters of the mouth, upper face, and head from texts, separately. In the speaker-specific stage, we present a 3D face model guided attention network to synthesize videos tailored for different individuals. It takes the animation parameters as input and exploits an attention mask to manipulate facial expression changes for the input individuals. Furthermore, to better establish authentic correspondences between visual motions (i.e., facial expression changes and head movements) and audios, we leverage a high-accuracy motion capture dataset instead of relying on long videos of specific individuals. After attaining the visual and audio correspondences, we can effectively train our network in an end-to-end fashion. Extensive experiments on qualitative and quantitative results demonstrate that our algorithm achieves high-quality photorealistic talking-head videos including various facial expressions and head motions according to speech rhythms and outperforms the state-of-the-art.

show abstract

Joint Learning of Facial Expression and Head Pose from Speech

Cited by 23 publications

References 31 publications

MakeltTalk

MakeltTalk

MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Contact Info

Product

Resources

About