2015 IEEE/SICE International Symposium on System Integration (SII) 2015
DOI: 10.1109/sii.2015.7404961
|View full text |Cite
|
Sign up to set email alerts
|

Talking heads synthesis from audio with deep neural networks

Abstract: Talking heads synthesis with expressions from speech is proposed in this paper. Talking heads synthesis can be considered as a learning problem of sequence-to-sequence mapping, which consists of audio as input and video as output. To synthesize talking heads, we use SAVEE database which consists of videos of multiple sentences speeches recorded from front of face. Audiovisual data can be considered as two parallel sequential data of audio and visual features and it is composed of continuous value. Thus, audio … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 23 publications
(23 citation statements)
references
References 6 publications
0
18
0
Order By: Relevance
“…Generating photo-realistic video portraits in line with any input audio stream has long been a popular research topic in com-puter graphics and vision [16,17,6]. Some methods aim at finding out the exact correspondence between audio and frames [39,11,56,27,59,15,34,38,40].…”
Section: Related Workmentioning
confidence: 99%
“…Generating photo-realistic video portraits in line with any input audio stream has long been a popular research topic in com-puter graphics and vision [16,17,6]. Some methods aim at finding out the exact correspondence between audio and frames [39,11,56,27,59,15,34,38,40].…”
Section: Related Workmentioning
confidence: 99%
“…Visual speech synthesis refers to the process of performing mouth animation according to a speech signal. This can be accomplished, for example, by performing a direct mapping from audio to animation parameters using a regression function [Cudeiro et al 2019;Hou et al 2016;Hussen Abdelaziz et al 2019;Karras et al 2017;Shimba et al 2015;Suwajanakorn et al 2017;Thies et al 2020;Zhou et al 2018]. For example [Shimba et al 2015] formulate the task as a sequence-to-sequence mapping with audio as input and video as output.…”
Section: Visual Speech Synthesismentioning
confidence: 99%
“…A similar approach is proposed by [Zhou et al 2018]. While they rely on similar audio features as [Shimba et al 2015], they use an animator centred face rig to represent facial expressions. This allows creating sequences of visual speech that can be edited and fine-tuned by a human animator.…”
Section: Visual Speech Synthesismentioning
confidence: 99%
“…via phonemic transcription. For direct approaches, the conversion function typically involves some form of regression [16,24,26,27] or indexing a codebook of visual features using the corresponding features extracted from the acoustic speech [3,13]. For indirect approaches, the mapping function involves concatenation or interpolation of pre-existing data [5,7,9,21,29] or using a generative model [2,10,17].…”
Section: Related Workmentioning
confidence: 99%