Speech2Face: Learning the Face Behind a Voice

Oh, Tae-Hyun; Dekel, Tali; Kim, Chang-Il; Mosseri, Inbar; Freeman, William T.; Rubinstein, Michael; Matusik, Wojciech

doi:10.1109/cvpr.2019.00772

Cited by 147 publications

(119 citation statements)

References 35 publications

(43 reference statements)

Supporting

Mentioning

108

Contrasting

Unclassified

Order By: Relevance

“…Hao et al [12] proposed a uniform framework using cycle constraint for the visual-audio mutual generation. Recently, Oh et al [8] presented a model for generating the face images from a voice using a pretrained face decoder, but the generated results are not sharp due to the concern of privacy. Duarte et al [14] generated sharp face images conditioned on the input speech segmentation using GANs.…”

Section: Audio-to-image Generationmentioning

confidence: 99%

“…Following this work, several text-to-image models [3], [4], [10] were proposed for the text-to-image task, based on the same teacher-student learning method to train the text encoder. Recently, teacher-student learning was used to generate the face behind a voice [13]. As a comparison, traditional audioto-image generation models [11] used a classifier as the feature extractor.…”

Section: E Teacher-student Learningmentioning

confidence: 99%

“…Although the class label of the speech description is accessible, there might be generalization problem when the trained model is tested on the new unseen data (new class respect to the training set). Inspired by the cross-modal generation models [3], [7], [13], [16], [23], in this work, we use teacher-student learning [30] to overcome this problem to some extent.…”

Section: B Teacher-student Learningmentioning

confidence: 99%

“…Text-to-image translation [3]- [5], [7] is a closely related topic to ours, which has been investigated for several years. In some text-to-image models, zero-shot learning based methods [7], [8] and generative adversarial networks (GANs) [9] have been used to extract features and synthesize realistic images, respectively. These models generalize better on the new testing classes by leveraging the teacher-student learning to train the text encoder [7], [10].…”

Section: Introductionmentioning

confidence: 99%

“…In addition to the text-to-image translation, several models for audio-toimage generation were also presented in recent years. Chen et al [11] and Hao et al [12] synthesized instrument images from different music inputs; Oh et al [13] and Amanda et al [14] reconstructed the human face images from input speech based on the positive correlations between a persons appearance and his voice, and both their frameworks contain a speech encoder and a face decoder. Different from these audio-to-image generation works, which model the acoustic or phonetic information mainly, our speech-to-image translation aims to model the linguistic information in the input speech and translate it into the images.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Direct Speech-to-Image Translation

Zhang

Jia

et al. 2020

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In this paper, we attempt to translate the speech signals into the image signals without the transcription stage. Specifically, a speech encoder is designed to represent the input speech signals as an embedding feature, and it is trained with a pretrained image encoder using teacher-student learning to obtain better generalization ability on new classes. Subsequently, a stacked generative adversarial network is used to synthesize high-quality images conditioned on the embedding feature. Experimental results on both synthesized and real data show that our proposed method is effective to translate the raw speech signals into images without the middle text representation. Ablation study gives more insights about our method.

show abstract

Section: Audio-to-image Generationmentioning

confidence: 99%

Section: E Teacher-student Learningmentioning

confidence: 99%

Section: B Teacher-student Learningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Direct Speech-to-Image Translation

Zhang

Jia

et al. 2020

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

Visual mechanisms for voice‐identity recognition flexibly adjust to auditory noise level

Maguinness

Kriegstein

2021

Human Brain Mapping

View full text Add to dashboard Cite

Recognising the identity of voices is a key ingredient of communication. Visual mechanisms support this ability: recognition is better for voices previously learned with their corresponding face (compared to a control condition). This so-called 'face-benefit' is supported by the fusiform face area (FFA), a region sensitive to facial form and identity. Behavioural findings indicate that the face-benefit increases in noisy listening conditions. The neural mechanisms for this increase are unknown. Here, using functional magnetic resonance imaging, we examined responses in face-sensitive regions while participants recognised the identity of auditory-only speakers (previously learned by face) in high (SNR À4 dB) and low (SNR +4 dB) levels of auditory noise. We observed a face-benefit in both noise levels, for most participants (16 of 21). In high-noise, the recognition of face-learned speakers engaged the right posterior superior temporal sulcus motion-sensitive face area (pSTS-mFA), a region implicated in the processing of dynamic facial cues. The face-benefit in high-noise also correlated positively with increased functional connectivity between this region and voice-sensitive regions in the temporal lobe in the group of 16 participants with a behavioural face-benefit. In low-noise, the face-benefit was robustly associated with increased responses in the FFA and to a lesser extent the right pSTS-mFA. The findings highlight the remarkably adaptive nature of the visual network supporting voiceidentity recognition in auditory-only listening conditions.

show abstract