The Effect of Real-Time Constraints on Automatic Speech Animation

Websdale, Danny; Taylor, Sarah L.; Milner, Ben

doi:10.21437/interspeech.2018-2066

Cited by 4 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The system that generates visual features directly from an audio waveform follows the deep learning approach in [22], and maps a sequence of input MFCC vectors to an output sequence of AAM vectors. The approach in [22] was developed for a single speaker and uses an audio-visual speech database (KB-2k) for training that provides both acoustic and visual features. Our approach extends this system to become speaker independent.…”

Section: B Multi-speaker Audio-to-visual Speech Mappingmentioning

confidence: 99%

“…In terms of context window sizes, a sequence of K A =33 MFCC features (340ms of audio) and K V B =3 visual features (100ms of video) gave best performance. Further details of the audio-to-visual speech model architecture can be found in [22].…”

Section: B Multi-speaker Audio-to-visual Speech Mappingmentioning

confidence: 99%

“…This uses hand corrected phoneme annotations and so represents an ideal system. Second is a speaker-dependent system based on [22] that maps directly from acoustic features to AAM features using the ground-truth data (Audio-to-AAM). This is trained on the single speaker dataset and, although not appropriate for later speaker-independent operation, it serves as a useful baseline to evaluate the error introduced by mapping directly from audio compared with from phonemes.…”

Section: A Single-speaker Analysismentioning

confidence: 99%

“…However, large vocabulary, robust speech recognisers require broad contextual windows that span multiple words in order to exploit language models which improve decoding accuracy [20]. This introduces a significant time lag that makes real-time speech animation prohibitive, where tolerable delays are of the order 200ms [21], [22]. Conversely, audio-driven speech animation has been shown to generate realistic lip synchronisation in real-time, but in a speaker-dependent setting [2], [22].…”

Section: Introductionmentioning

confidence: 99%

“…This introduces a significant time lag that makes real-time speech animation prohibitive, where tolerable delays are of the order 200ms [21], [22]. Conversely, audio-driven speech animation has been shown to generate realistic lip synchronisation in real-time, but in a speaker-dependent setting [2], [22]. Our work leverages the advantages of both audio and text driven methods to achieve speaker-independent, real-time speech animation.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations