Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with highquality videos of youtubers with notable expressiveness in both the speech and visual signals.
In this work, we present a novel learning based approach to reconstruct 3D faces from a single or multiple images. Our method uses a simple yet powerful architecture based on siamese neural networks that helps to extract relevant features from each view while keeping the models small. Instead of minimizing multiple objectives, we propose to simultaneously learn the 3D shape and the individual camera poses by using a single term loss based on the reprojection error, which generalizes from one to multiple views. This allows to globally optimize the whole scene without having to tune any hyperparameters and to achieve low reprojection errors, which are important for further texture generation. Finally, we train our model on a large scale dataset with more than 6,000 facial scans. We report competitive results in 3DFAW 2019 challenge, showing the effectiveness of our method.
io/h3d-net Figure 1. We introduce H3D-Net, a method for high-fidelity 3D head reconstruction in the wild. Our method estimates a signed distance function (SDF) of the head by optimizing a coordinate-based neural network on a small set of input images. This optimization process is constrained by a pre-trained probabilistic model of 3D head SDFs to obtain plausible shapes in few-shot setups. The figure shows the 3D head reconstruction of three scenes obtained with the proposed method from only three images with associated masks and camera poses.
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised approach by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with highquality videos of youtubers with notable expressiveness in both the speech and visual signals.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.