Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Prajwal, K R; Mukhopadhyay, Rudrabha; Namboodiri, Vinay P.; Jawahar, C. V.

doi:10.1109/cvpr42600.2020.01381

Cited by 84 publications

(126 citation statements)

References 21 publications

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…In the following sections, we show quantitative comparisons based on the Short-Time Objective Intelligibility (STOI) [17] and Extended Short-Time Objective Intelligibility (ESTOI) [18] and Perceptual Evaluation of Speech Quality (PESQ) [19] metrics. We also make a qualitative comparison between samples from our model and current state-of-the-art Lip2Wav model [11]. As mentioned before, speech prediction is a many-to-many mapping problem.…”

Section: Resultsmentioning

confidence: 99%

“…Each sentence contains six words chosen from a fixed dictionary. Similar to prior works [7,9,11], we used four speakers (S1, S2, S4, and S29) from the dataset for comparison.…”

Section: Resultsmentioning

confidence: 99%

“…Figure 2 shows a qualitative comparison between our proposed model and currently state-of-the-art Lip2Wav model [11]. We found that the Lip2Wav model is biased towards overly smooth results that might explain the lower PESQ score, which measures speech quality, obtained by the Lip2Wav model.…”

Section: Qualitative Evaluationmentioning

confidence: 99%

“…al. [11] proposed a Tacotron2 model [12] variant for visual to speech generation. The model consists of a stack of 3D convolution layers that encode the frames' sequence into a fixed-size feature vector.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Speech Prediction in Silent Videos Using Variational Autoencoders

Yadav

Sardana²,

Namboodiri

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Each sentence contains six words chosen from a fixed dictionary. Similar to prior works [7,9,11], we used four speakers (S1, S2, S4, and S29) from the dataset for comparison.…”

Section: Resultsmentioning

confidence: 99%

Section: Qualitative Evaluationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Speech Prediction in Silent Videos Using Variational Autoencoders

Yadav

Sardana²,

Namboodiri

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In [2], researchers have stated that viewing a speaker's face significantly enhances a person's capacity to understand the speech in a noisy environment. The use of visual modality has also been proved fruitful in different speech processing algorithms, such as audio visual speech recognition [3], lip reading [4,5], and lip to speech synthesis [6], etc. Recent studies also demonstrated that the use of visual features can help in speech denoising in very low signal to noise ratio (SNR) conditions [7,8].…”

Section: Introductionmentioning

confidence: 99%

An Empirical Study of Visual Features for DNN Based Audio-Visual Speech Enhancement in Multi-Talker Environments

Shetu

Chakrabarty

Habets

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned features are fused to utilize the information from both modalities. There have been various studies on suitable audio input features and network architectures, however, to the best of our knowledge, there is no published study that has investigated which visual features are best suited for this specific task. In this work, we perform an empirical study of the most commonly used visual features for DNN based AVSE, the pre-processing requirements for each of these features, and investigate their influence on the performance. Our study shows that despite the overall better performance of embedding-based features, their computationally intensive pre-processing makes their use difficult in low resource systems. For such systems, optical flow or raw pixels-based features might be better suited.

show abstract