Speech-driven lip motion generation with a trajectory HMM

Hofer, Gregor; Yamagishi, Junichi; Shimodaira, Hiroshi

doi:10.21437/interspeech.2008-591

Cited by 29 publications

(5 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is important to note that efforts to create systems that transform acoustic speech to animated faces predate DNNs’ widespread use. For example, statistical machine learning (ML) methods such as Hidden Markov models (HMM) have been used previously to generate moving mouth animations from either text or speech audio (Al Moubayed et al, 2008; Hofer et al, 2008; Masuko et al, 1998; Schabus et al, 2013; Tamura et al, 1998). Other ML methods such as QR factorization (Lucero et al, 2006) and artificial neural networks (ANN) (Massaro et al, 1999) have also been used for similar purposes.…”

Section: Synthetic Talking Face Generation Using Deep Neural Network ...mentioning

confidence: 99%

Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Tong

Wenner

et al. 2022

Trends in Hearing

View full text Add to dashboard Cite

Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of −9, −6, −3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid.

show abstract

Section: Synthetic Talking Face Generation Using Deep Neural Network ...mentioning

confidence: 99%

Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Tong

Wenner

et al. 2022

Trends in Hearing

View full text Add to dashboard Cite

show abstract

“…A rich set of other methods [29] were designed so as to produce expression-guided speech videos. In [30], the authors proposed to intelligently predict lip-based movement trajectory using human speech. The designed system accurately calculates human lip movements from the original human…”

Section: B Facial Animation Video Driven By Speechmentioning

confidence: 99%

“…In this subsection, our designed animation system is compared with the facial systems proposed by Deng et al [28], Kshirsagar et al [29], Hofer et al [30], and Zoric et al [40] respectively. Noticeably, either accuracy or ranking is an optimal choice for this task.…”

Section: ) Morphing-based Facial Animation Videomentioning

confidence: 99%

Low-Rank Active Learning for Generating Speech-Drive Human Face Animation

Xu,

Yu,

Cheng

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Emotion&speech-based human facial animation technique can be considered as a useful application in many artificial intelligent systems. Given a speech signal, the recognizer output a sequence of the phoneme and emotion pairs. Thereby, we calculate the sequence of viseme and expression pairs accordingly, which are subsequently transformed to a consistent and synchronous video describing facial animation. This article introduces a novel facial animation technique that can intelligently generates real human face animation videos by leveraging an emotional speech. More specifically, we first extract acoustic features sufficiently discriminative to the emotion and phoneme pairs. And the corresponding sequence of phoneme and emotion pairs are computed. Next, we propose a low-rank active learning paradigm for discovering multiple key facial frames that can best represent the above phoneme and emotion pairs in the feature subspace. We associate each phoneme and emotion pair with a key facial frame, based on which the well-known morphing technique fits the associated key facial frames to a smooth animated facial video. We focus on generating multiple transitional facial frames between pairwise temporally adjacent key ones. Experiments demonstrated that the synthesized facial videos look real, smooth, and synchronous with different male/female speeches.

show abstract

“…A rich set of other methods [27] were designed so as to produce expression-guided speech videos. In [28], the authors proposed to intelligently predict lip-based movement trajectory using human speech. The designed system accurately calculates human lip movements from the original human speech.…”

Section: B Facial Animation Video Driven By Speechmentioning

confidence: 99%

“…In this subsection, our designed animation system is compared with the facial systems proposed by Deng et al [26], Kshirsagar et al [27], Hofer et al [28], and Zoric et al [29] respectively. Noticeably, either accuracy or ranking is an optimal choice for this task.…”

Section: ) Morphing-based Facial Animation Videomentioning

confidence: 99%

Synthesizing Speech-guided Face Animation by Low-Rank and Noise-tolerant Active Learning

Marchesoti

2024

IEEE Access

View full text Add to dashboard Cite

Emotion&speech-based human facial animation technique can be considered as a useful application in many artificial intelligent systems. Given a speech signal, the recognizer output a sequence of the phoneme and emotion pairs. Thereby, we calculate the sequence of viseme and expression pairs accordingly, which are subsequently transformed to a consistent and synchronous video describing facial animation. This article introduces a novel facial animation technique that can intelligently generates real human face animation videos by leveraging an emotional speech. More specifically, we first extract acoustic features sufficiently discriminative to the emotion and phoneme pairs by deploying a multi-label feature selector. And the corresponding sequence of phoneme and emotion pairs are computed. Then, we propose a low-rank active learning paradigm for discovering multiple key facial frames that can best represent the above phoneme and emotion pairs in the feature subspace. Theoretically, the designed active learning is highly tolerant to video frame noises. Subsequently, we associate each phoneme and emotion pair with a key facial frame, based on which the well-known morphing technique fits the associated key facial frames to a smooth animated facial video. It focuses on generating multiple transitional facial frames between pairwise temporally adjacent key frames. Experiments have demonstrated that the synthesized facial videos look real, smooth, and synchronous with different male/female speeches.

show abstract

Speech-driven lip motion generation with a trajectory HMM

Cited by 29 publications

References 9 publications

Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face

Low-Rank Active Learning for Generating Speech-Drive Human Face Animation

Synthesizing Speech-guided Face Animation by Low-Rank and Noise-tolerant Active Learning

Contact Info

Product

Resources

About