2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01381
|View full text |Cite
|
Sign up to set email alerts
|

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

Abstract: Humans involuntarily tend to infer parts of the conversation from lip movements when the speech is absent or corrupted by external noise. In this work, we explore the task of lip to speech synthesis, i.e., learning to generate natural speech given only the lip movements of a speaker. Acknowledging the importance of contextual and speakerspecific cues for accurate lip-reading, we take a different path from existing works. We focus on learning accurate lip sequences to speech mappings for individual speakers in … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
124
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 84 publications
(126 citation statements)
references
References 21 publications
1
124
0
Order By: Relevance
“…In the following sections, we show quantitative comparisons based on the Short-Time Objective Intelligibility (STOI) [17] and Extended Short-Time Objective Intelligibility (ESTOI) [18] and Perceptual Evaluation of Speech Quality (PESQ) [19] metrics. We also make a qualitative comparison between samples from our model and current state-of-the-art Lip2Wav model [11]. As mentioned before, speech prediction is a many-to-many mapping problem.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…In the following sections, we show quantitative comparisons based on the Short-Time Objective Intelligibility (STOI) [17] and Extended Short-Time Objective Intelligibility (ESTOI) [18] and Perceptual Evaluation of Speech Quality (PESQ) [19] metrics. We also make a qualitative comparison between samples from our model and current state-of-the-art Lip2Wav model [11]. As mentioned before, speech prediction is a many-to-many mapping problem.…”
Section: Resultsmentioning
confidence: 99%
“…Each sentence contains six words chosen from a fixed dictionary. Similar to prior works [7,9,11], we used four speakers (S1, S2, S4, and S29) from the dataset for comparison.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…In [2], researchers have stated that viewing a speaker's face significantly enhances a person's capacity to understand the speech in a noisy environment. The use of visual modality has also been proved fruitful in different speech processing algorithms, such as audio visual speech recognition [3], lip reading [4,5], and lip to speech synthesis [6], etc. Recent studies also demonstrated that the use of visual features can help in speech denoising in very low signal to noise ratio (SNR) conditions [7,8].…”
Section: Introductionmentioning
confidence: 99%