Talking heads synthesis from audio with deep neural networks

Shimba, Taiki; Sakurai, Ryo; Yamazoe, Hirotake; Lee, Joo‐Ho

doi:10.1109/sii.2015.7404961

Cited by 23 publications

(23 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generating photo-realistic video portraits in line with any input audio stream has long been a popular research topic in com-puter graphics and vision [16,17,6]. Some methods aim at finding out the exact correspondence between audio and frames [39,11,56,27,59,15,34,38,40].…”

Section: Related Workmentioning

confidence: 99%

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

Yao¹,

Zhong²,

Yan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

Yao¹,

Zhong²,

Yan³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…Visual speech synthesis refers to the process of performing mouth animation according to a speech signal. This can be accomplished, for example, by performing a direct mapping from audio to animation parameters using a regression function [Cudeiro et al 2019;Hou et al 2016;Hussen Abdelaziz et al 2019;Karras et al 2017;Shimba et al 2015;Suwajanakorn et al 2017;Thies et al 2020;Zhou et al 2018]. For example [Shimba et al 2015] formulate the task as a sequence-to-sequence mapping with audio as input and video as output.…”

Section: Visual Speech Synthesismentioning

confidence: 99%

“…A similar approach is proposed by [Zhou et al 2018]. While they rely on similar audio features as [Shimba et al 2015], they use an animator centred face rig to represent facial expressions. This allows creating sequences of visual speech that can be edited and fine-tuned by a human animator.…”

Section: Visual Speech Synthesismentioning

confidence: 99%

Neural Face Models for Example-Based Visual Speech Synthesis

Paier

Hilsmann

Eisert

2020

European Conference on Visual Media Production

View full text Add to dashboard Cite

“…via phonemic transcription. For direct approaches, the conversion function typically involves some form of regression [16,24,26,27] or indexing a codebook of visual features using the corresponding features extracted from the acoustic speech [3,13]. For indirect approaches, the mapping function involves concatenation or interpolation of pre-existing data [5,7,9,21,29] or using a generative model [2,10,17].…”

Section: Related Workmentioning

confidence: 99%

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Abdelaziz

Theobald

Binder

et al. 2019

2019 International Conference on Multimodal Interaction

View full text Add to dashboard Cite

Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the AM on ten thousand hours of audio-only data. The AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to one with a random initialization. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

show abstract

Talking heads synthesis from audio with deep neural networks

Cited by 23 publications

References 6 publications

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering

Neural Face Models for Example-Based Visual Speech Synthesis

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Contact Info

Product

Resources

About