2016
DOI: 10.1007/978-3-319-47665-0_18
|View full text |Cite
|
Sign up to set email alerts
|

Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

Abstract: Abstract. Previous work in speech-driven head motion synthesis is centred around Hidden Markov Model (HMM) based methods and data that does not show a large variability of expressiveness in both speech and motion. When using expressive data, these systems often fail to produce satisfactory results. Recent studies have shown that using deep neural networks (DNNs) results in a better synthesis of head motion, in particular when employing bidirectional long short-term memory (BLSTM). We present a novel approach w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
32
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
3
3

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(33 citation statements)
references
References 21 publications
1
32
0
Order By: Relevance
“…The MLP models are however limited in modelling temporal data. Ding et al [15] and Haag and Shimodaira [14] thus compared the MLP with a bidirectional long short-term memory (BLSTM) model in the head motion synthesis task. Both works reported improvement of the BLSTM-based system over the MLPbased one in terms of the naturalness of the synthesised motion assessed via a user study, root mean-squared error and canonical correlations between the original and synthesised head motion.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The MLP models are however limited in modelling temporal data. Ding et al [15] and Haag and Shimodaira [14] thus compared the MLP with a bidirectional long short-term memory (BLSTM) model in the head motion synthesis task. Both works reported improvement of the BLSTM-based system over the MLPbased one in terms of the naturalness of the synthesised motion assessed via a user study, root mean-squared error and canonical correlations between the original and synthesised head motion.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, humans tend to feel unsettled when humanoid robots do not act realistically [11], and when there is an incompatibility between speech and associated gestures [12]. Several works developed audio-driven motion synthesis systems targeted at animated conversational agents and talking avatars, for example, for the synthesis of head motion [13][14][15] or hand movements [16,17]. To the best of our knowledge, there has been no previous attempt at whole upper-body (head, hands and torso) motion synthesis from audio on a humanoid robot.…”
Section: Introductionmentioning
confidence: 99%
“…Several recent works have applied neural networks in this domain [10,11,29,30,34]. Among the cited works, Haag & Shimodaira [11] use a bottleneck network to learn compact representations, although their bottleneck features subsequently are used to define prediction inputs rather than prediction outputs as in the work we presented. Our proposed method works on a different aspect of non-verbal behavior that co-occurs with speech, namely generating body motion driven by speech.…”
Section: Data-driven Head and Face Movementsmentioning
confidence: 99%
“…Deep Bi-Directional Long Short Term Memory (BLSTM) models appear in Ding et al [24], where they report improvements over their own earlier work. More recently Haag [25] uses BLSTMs and Bottleneck features [26]. In our own earlier work [4], we use a BLSTM based Conditional Variational Autoencoder (CVAE) to model the many-to-many mapping of speech to head pose prediction, both for speaker, and for the head pose of the listener in dyadic conversation [27].…”
Section: Head Posementioning
confidence: 99%