2012
DOI: 10.1109/tasl.2012.2202651
|View full text |Cite
|
Sign up to set email alerts
|

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Abstract: We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceive… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2013
2013
2023
2023

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(2 citation statements)
references
References 62 publications
0
2
0
Order By: Relevance
“…comparing feature trajectories between synthesised and ground truth data, or calculating RMS error between synthesised and ground truth meshes etc. However, Theobald and Matthews [2012] show that objective measures are not necessarily a good indicator of the subjective perception of naturalness in a synthesis technique. RMS error seems a particularly poor choice since it averages across an entire sequence, whereas the authors show that an artefact in a single frame of a sequence can lead to the entire sequence being perceived as bad.…”
Section: Resultsmentioning
confidence: 97%
“…comparing feature trajectories between synthesised and ground truth data, or calculating RMS error between synthesised and ground truth meshes etc. However, Theobald and Matthews [2012] show that objective measures are not necessarily a good indicator of the subjective perception of naturalness in a synthesis technique. RMS error seems a particularly poor choice since it averages across an entire sequence, whereas the authors show that an artefact in a single frame of a sequence can lead to the entire sequence being perceived as bad.…”
Section: Resultsmentioning
confidence: 97%
“…Samplebased approaches concatenate visual speech units contained in a database, where the units might be fixed-length (e.g. phonemes, visemes, or words [4,5,6,7]) or of variable length [8,9,10]. A cost function, based on phonetic context and smoothness of concatenation, is then minimised to find the set of units which form the animation.…”
Section: Introductionmentioning
confidence: 99%