Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Theobald, Barry-John; Matthews, Iain

doi:10.1109/tasl.2012.2202651

Cited by 11 publications

(2 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…comparing feature trajectories between synthesised and ground truth data, or calculating RMS error between synthesised and ground truth meshes etc. However, Theobald and Matthews [2012] show that objective measures are not necessarily a good indicator of the subjective perception of naturalness in a synthesis technique. RMS error seems a particularly poor choice since it averages across an entire sequence, whereas the authors show that an artefact in a single frame of a sequence can lead to the entire sequence being perceived as bad.…”

Section: Resultsmentioning

confidence: 97%

Expressive Modulation of Neutral Visual Speech

Shaw

Theobald

2016

IEEE MultiMedia

Self Cite

View full text Add to dashboard Cite

Section: Resultsmentioning

confidence: 97%

Expressive Modulation of Neutral Visual Speech

Shaw

Theobald

2016

IEEE MultiMedia

Self Cite

View full text Add to dashboard Cite

“…Samplebased approaches concatenate visual speech units contained in a database, where the units might be fixed-length (e.g. phonemes, visemes, or words [4,5,6,7]) or of variable length [8,9,10]. A cost function, based on phonetic context and smoothness of concatenation, is then minimised to find the set of units which form the animation.…”

Section: Introductionmentioning

confidence: 99%

Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Thangthai¹,

Milner²,

Taylor³

2016

Interspeech 2016

View full text Add to dashboard Cite

This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

show abstract

Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network

Liu

Zhou

et al. 2020

Pattern Recognition

View full text Add to dashboard Cite

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Cited by 11 publications

References 62 publications

Expressive Modulation of Neutral Visual Speech

Expressive Modulation of Neutral Visual Speech

Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Synthesizing Talking Faces from Text and Audio: An Autoencoder and Sequence-to-Sequence Convolutional Neural Network

Contact Info

Product

Resources

About