2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00772
|View full text |Cite
|
Sign up to set email alerts
|

Speech2Face: Learning the Face Behind a Voice

Abstract: How much can we infer about a person's looks from the way they speak? In this paper, we study the task of reconstructing a facial image of a person from a short audio recording of that person speaking. We design and train a deep neural network to perform this task using millions of natural Internet/YouTube videos of people speaking. During training, our model learns voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

3
108
1
1

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 147 publications
(119 citation statements)
references
References 35 publications
(43 reference statements)
3
108
1
1
Order By: Relevance
“…Hao et al [12] proposed a uniform framework using cycle constraint for the visual-audio mutual generation. Recently, Oh et al [8] presented a model for generating the face images from a voice using a pretrained face decoder, but the generated results are not sharp due to the concern of privacy. Duarte et al [14] generated sharp face images conditioned on the input speech segmentation using GANs.…”
Section: Audio-to-image Generationmentioning
confidence: 99%
See 4 more Smart Citations
“…Hao et al [12] proposed a uniform framework using cycle constraint for the visual-audio mutual generation. Recently, Oh et al [8] presented a model for generating the face images from a voice using a pretrained face decoder, but the generated results are not sharp due to the concern of privacy. Duarte et al [14] generated sharp face images conditioned on the input speech segmentation using GANs.…”
Section: Audio-to-image Generationmentioning
confidence: 99%
“…Following this work, several text-to-image models [3], [4], [10] were proposed for the text-to-image task, based on the same teacher-student learning method to train the text encoder. Recently, teacher-student learning was used to generate the face behind a voice [13]. As a comparison, traditional audioto-image generation models [11] used a classifier as the feature extractor.…”
Section: E Teacher-student Learningmentioning
confidence: 99%
See 3 more Smart Citations