Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2066
|View full text |Cite
|
Sign up to set email alerts
|

The Effect of Real-Time Constraints on Automatic Speech Animation

Abstract: Machine learning has previously been applied successfully to speech-driven facial animation. To account for carry-over and anticipatory coarticulation a common approach is to predict the facial pose using a symmetric window of acoustic speech that includes both past and future context. Using future context limits this approach for animating the faces of characters in real-time and networked applications, such as online gaming. An acceptable latency for conversational speech is 200ms and typically network trans… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…The system that generates visual features directly from an audio waveform follows the deep learning approach in [22], and maps a sequence of input MFCC vectors to an output sequence of AAM vectors. The approach in [22] was developed for a single speaker and uses an audio-visual speech database (KB-2k) for training that provides both acoustic and visual features. Our approach extends this system to become speaker independent.…”
Section: B Multi-speaker Audio-to-visual Speech Mappingmentioning
confidence: 99%
See 4 more Smart Citations
“…The system that generates visual features directly from an audio waveform follows the deep learning approach in [22], and maps a sequence of input MFCC vectors to an output sequence of AAM vectors. The approach in [22] was developed for a single speaker and uses an audio-visual speech database (KB-2k) for training that provides both acoustic and visual features. Our approach extends this system to become speaker independent.…”
Section: B Multi-speaker Audio-to-visual Speech Mappingmentioning
confidence: 99%
“…In terms of context window sizes, a sequence of K A =33 MFCC features (340ms of audio) and K V B =3 visual features (100ms of video) gave best performance. Further details of the audio-to-visual speech model architecture can be found in [22].…”
Section: B Multi-speaker Audio-to-visual Speech Mappingmentioning
confidence: 99%
See 3 more Smart Citations