Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-894
|View full text |Cite
|
Sign up to set email alerts
|

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

Abstract: Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable.Rigid head motion is one visual mode that universally cooccurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
29
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 39 publications
(29 citation statements)
references
References 22 publications
0
29
0
Order By: Relevance
“…Several recent works have used neural networks to generate bodymotion aspects such as locomotion [HHS * 17, HKS17, HAB19], lip movements [SSKS17] and head motion [GLM17,SB18]. A challenge in these domains is the large variation in the output given the same control.…”
Section: Data-driven Human Body-motion Generationmentioning
confidence: 99%
“…Several recent works have used neural networks to generate bodymotion aspects such as locomotion [HHS * 17, HKS17, HAB19], lip movements [SSKS17] and head motion [GLM17,SB18]. A challenge in these domains is the large variation in the output given the same control.…”
Section: Data-driven Human Body-motion Generationmentioning
confidence: 99%
“…Yoon et al [45] learned a mapping from text to gestures using a recurrent neural network. Speech-driven head-motion and facial gesture generation has been performed using methods such as Variational Autoencoders (VAEs) [31] to predict head pose conditioned on acoustic features [19], Bidirectional Long Short-Term Memory (BLSTM) networks [20,21,41], and conditional Generative Adversarial Networks (GANs) [18] as seen in [11,42]. In another line of work, Karras et al [28] trained a CNN-based neural network using speech together with a learned emotion representation as input to generate corresponding 3D meshes of faces with impressively little training data.…”
Section: Gesture Generationmentioning
confidence: 99%
“…More recently Haag [25] uses BLSTMs and Bottleneck features [26]. In our own earlier work [4], we use a BLSTM based Conditional Variational Autoencoder (CVAE) to model the many-to-many mapping of speech to head pose prediction, both for speaker, and for the head pose of the listener in dyadic conversation [27].…”
Section: Head Posementioning
confidence: 99%
“…Carrying on from our previous work [4], we use Deep BLSTMs to predict the facial deformation and the six DoF of rigid head pose combined. Clearly, much of the activity of the orofacial region has significant correspondence with speech production.…”
Section: Model Descriptionmentioning
confidence: 99%
See 1 more Smart Citation