Partial linear regression for speech-driven talking head application

Hsieh, Chao-Kuei; Chen, Yung‐Chang

doi:10.1016/j.image.2005.04.002

Cited by 5 publications

(3 citation statements)

References 25 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Acoustic features used in feature-driven synthesis of visual speech have included Mel frequency cepstral coefficients (MFCCs) [14]- [16], [18]- [20]; filter-bank outputs [6]; line spectral pairs/frequencies (LSPs/LSFs) [17], [40]; formant frequencies [21]; linear prediction coefficients (LPCs) [12], [13] or perceptual LPCs (RASTA-PLP) [11]; and several forms of mapping function have been proposed, including vectorquantisation or a nearest neighbour look up [6]; regression [17], [19]; artificial neural networks [12], [13], [15], [18], [41]; hidden Markov models (HMMs) [11], [42]; and switching linear dynamical systems [14].…”

Section: Related Workmentioning

confidence: 99%

“…One approach for objectively measuring the performance of a synthesizer is to re-synthesize a set of test sentences for which the original visual speech is available and measure the distance between key points located about the face [6], [17], [44], [46], [52], [53], within the parameters used to model the visual speech [20], [23], [43], [54]- [56], or in the image pixels [12]. Although this approach is intuitive and simple to compute, there are two main limitations.…”

Section: A Evaluating Visual Speech Synthesizersmentioning

confidence: 99%

See 1 more Smart Citation

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Theobald

Matthews

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

show abstract

Section: Related Workmentioning

confidence: 99%

Section: A Evaluating Visual Speech Synthesizersmentioning

confidence: 99%

Relating Objective and Subjective Performance Measures for AAM-Based Visual Speech Synthesis

Theobald

Matthews

2012

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Many systems use text and the corresponding phoneme string as input and then use concatenation [1], dominance functions [2] or trajectory generation [3] to produce the desired animation. Other approaches use parameterised speech directly as input and then use formant analysis [4], linear regression [5], or probabilistic modelling [6] [7] to generate the appropriate motion.…”

Section: Introductionmentioning

confidence: 99%