Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2587
|View full text |Cite
|
Sign up to set email alerts
|

Joint Learning of Facial Expression and Head Pose from Speech

Abstract: Natural movement plays a significant role in realistic speech animation, and numerous studies have demonstrated the contribution visual cues make to the degree human observers find an animation acceptable. Natural, expressive, emotive, and prosodic speech exhibits motion patterns that are difficult to predict with considerable variation in visual modalities. Recently, there have been some impressive demonstrations of face animation derived in some way from the speech signal. Each of these methods have taken un… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 23 publications
(29 citation statements)
references
References 31 publications
0
29
0
Order By: Relevance
“…Attention masks are used to focus on the most changing parts on the face, especially the lips. Greenwood et al [2018] jointly learnt facial expressions and head poses in terms of landmarks from a forked Bi-directional LSTM network. Most previous audio-to-face-animation work focused on matching speech content and left out style/identity information since the identity is usually bypassed due to mode collapse or averaging during training.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Attention masks are used to focus on the most changing parts on the face, especially the lips. Greenwood et al [2018] jointly learnt facial expressions and head poses in terms of landmarks from a forked Bi-directional LSTM network. Most previous audio-to-face-animation work focused on matching speech content and left out style/identity information since the identity is usually bypassed due to mode collapse or averaging during training.…”
Section: Related Workmentioning
confidence: 99%
“…Last but not least, handling lip syncing and facial animation are not sufficient for the perception of realism of talking-heads. The entire facial expression considering the correlation between all facial elements and head pose also play an important role [Faigin 2012;Greenwood et al 2018]. These correlations, however, are less constrained by the audio and thus hard to be estimated.…”
Section: Introductionmentioning
confidence: 99%
“…These approaches achieve highly realistic results, but they are typically personalized and are not audio-driven. Most fully speech-driven 3D face animation techniques require either personalized models [5,22,26] or map to lower fidelity blendshape models [24] or facial landmarks [12,16]. Cao et al [5] propose speech-driven animation of a realistic textured personalized 3D face model that requires mocap data from the person to be animated, offline processing and blending of motion snippets.…”
Section: Related Workmentioning
confidence: 99%
“…Head animation synthesis focuses on synthesizing head pose from input speech. Some works direct regress head pose with BiLSTM (Ding, Zhu, and Xie 2015;Greenwood, Matthews, and Laycock 2018) or the encoder of transformer (Vaswani et al 2017). More precisely, head pose generation from speech is a one-to-many mapping, Sadoughi and Busso (2018) employ GAN (Goodfellow et al 2014;Mirza and Osindero 2014;Yu et al 2019b,a) to retain the diversity.…”
Section: Facial Animation Synthesismentioning
confidence: 99%