Head motion synthesis from speech using deep neural networks

Ding, Chen; Xie, Lei; Zhu, Pengcheng

doi:10.1007/s11042-014-2156-2

Cited by 47 publications

(53 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is similar to the work of [10] except that we did not use RBMs in pre-training. Acoustic and EMA features were concatenated from a context of five frames to the left and fives frames to the right of the current frame, resulting in a 572-dimensional input vector.…”

Section: Experimental Setupsmentioning

confidence: 97%

See 1 more Smart Citation

Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

Haag

Shimodaira

2016

Intelligent Virtual Agents

View full text Add to dashboard Cite

Abstract. Previous work in speech-driven head motion synthesis is centred around Hidden Markov Model (HMM) based methods and data that does not show a large variability of expressiveness in both speech and motion. When using expressive data, these systems often fail to produce satisfactory results. Recent studies have shown that using deep neural networks (DNNs) results in a better synthesis of head motion, in particular when employing bidirectional long short-term memory (BLSTM). We present a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability. Our proposed DNN architecture outperforms conventional feed-forward DNNs and simple BLSTM networks in an objective evaluation. Results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNNs.

show abstract

Section: Experimental Setupsmentioning

confidence: 97%

“…Ding et al [10] were the first to use DNNs for speech-driven head motion synthesis. They pre-trained a deep belief network (DBN) with stacked restricted Boltzmann machines, then added a target layer on top of the DBN for parameter fine-tuning.…”

Section: Introductionmentioning

confidence: 99%

Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

Haag

Shimodaira

2016

Intelligent Virtual Agents

View full text Add to dashboard Cite

show abstract

“…DNNs were proposed as a modelling strategy for head motion prediction by Ding et al [13]. Using a deep Feed-Forward Neural Network (FFN) regression model to predict Euler angles of nod, yaw and roll, they were able to report advantages over the previous HMM based approaches and were able to avoid the problem of clustering motion.…”

Section: Introductionmentioning

confidence: 99%

“…Another example by Sutskever et al [17] reports state of the art performance for the language translation task. Ding et al [18] introduced Bi-Directional Long Short Term Memory (BLSTM) networks to the head motion task, noting improvements over their own earlier work [13]. More recently Haag [19] uses BLSTMs and Bottleneck features [20] and noted a subtle improvement.…”

Section: Introductionmentioning

confidence: 99%

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

Greenwood¹,

Laycock²,

Matthews³

2017

Interspeech 2017

View full text Add to dashboard Cite

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable.Rigid head motion is one visual mode that universally cooccurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose.Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodate.

show abstract

“…Related work has also converted acoustic speech features (e.g. filter bank, MFCC, LPC) into head motion parameters (nod, yaw, roll) using a feed-forward neural network model [18]. This paper continues with the DNN-based approach for predicting visual features from a text input but aims to improve the resulting naturalness of the animation.…”

Section: Introductionmentioning

confidence: 99%

Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Thangthai¹,

Milner²,

Taylor³

2016

Interspeech 2016

View full text Add to dashboard Cite

This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

show abstract

Head motion synthesis from speech using deep neural networks

Cited by 47 publications

References 35 publications

Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis

Predicting Head Pose from Speech with a Conditional Variational Autoencoder

Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Contact Info

Product

Resources

About