International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction 2010
DOI: 10.1145/1891903.1891942
|View full text |Cite
|
Sign up to set email alerts
|

Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model

Abstract: We present a novel approach to speech-driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Audio and visual data from a talking head corpus are jointly modelled using the proposed method. The switching states are found using variable length Markov models trained on labelled phonetic data. We also propose a synthesis technique that takes into account both … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 25 publications
0
13
0
Order By: Relevance
“…A more lexible approach is to use a generative statistical model, such as GMMs [Luo et al 2014], switching linear dynamical systems [Englebienne et al 2007], switching shared Gaussian process dynamical models [Deena et al 2010], recurrent neural networks [Fan et al 2015], or hidden Markov models (HMMs) and their variants [Anderson et al 2013;Brand 1999;Fu et al 2005;Govokhina et al 2006;Schabus et al 2011;Wang et al 2012;Xie and Liu 2007]. During training of a HMM-based synthesiser, context-dependent decision trees cluster motion data and combine states with similar distributions to account for sparsity of the phonetic contexts in the training set.…”
Section: Related Workmentioning
confidence: 99%
“…A more lexible approach is to use a generative statistical model, such as GMMs [Luo et al 2014], switching linear dynamical systems [Englebienne et al 2007], switching shared Gaussian process dynamical models [Deena et al 2010], recurrent neural networks [Fan et al 2015], or hidden Markov models (HMMs) and their variants [Anderson et al 2013;Brand 1999;Fu et al 2005;Govokhina et al 2006;Schabus et al 2011;Wang et al 2012;Xie and Liu 2007]. During training of a HMM-based synthesiser, context-dependent decision trees cluster motion data and combine states with similar distributions to account for sparsity of the phonetic contexts in the training set.…”
Section: Related Workmentioning
confidence: 99%
“…Meanwhile, the assessment for the naturalness criterion with a scale of 1 to 5, where scale 1 means 'the motion of the mouth is very unrepresentative and the movement of the mouth is very rough', scale 2 means 'unrepresentative mouth motion and rough mouth movement changes', Scale 3 means 'relatively representative mouth movements and moderately gentle motion movements', scale 4 means 'representing mouth motion and subtle gesture motion', scale 5 means 'very representative mouth movements and very fine motion movements' change. The average respondent scores were calculated using Mean Opinion Score (MOS) formulated as equation (3).…”
Section: Resultsmentioning
confidence: 99%
“…Recently, natural and realistic facial animation is one of the most challenging research areas [3]. The effort to produce natural and realistic face animation is to add transition animations between pronunciations or apply dynamic viseme to the visual speech synthesis [4].…”
Section: Introductionmentioning
confidence: 99%
“…The closest approximation familiar to us is speech-driven facial animation [12]. We here discuss specific issues in using GPDMs as speech acoustic models, and propose an initialization scheme for sequential signals such as speech utterances.…”
Section: Implementing Gpdms For Speechmentioning
confidence: 99%