Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Hegde, Sindhu B; Prajwal, K R; Mukhopadhyay, Rudrabha; Namboodiri, Vinay P.; Jawahar, C. V.

doi:10.1145/3503161.3548081

Cited by 4 publications

(6 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lip2Wav worked reasonably well on speakers seen during training but did not extend to handling unseen speakers. Several more recent studies [17,20,21,23] have attempted to develop end-to-end lipto-speech synthesis models on large datasets containing data from hundreds of speakers. For instance, in [17], authors proposed a variational approach that matches the distributions of lip movements and speech segments to project them into a shared space, which allows for handling the high variations of in-the-wild speakers to some extent.…”

Section: Related Workmentioning

confidence: 99%

“…Several more recent studies [17,20,21,23] have attempted to develop end-to-end lipto-speech synthesis models on large datasets containing data from hundreds of speakers. For instance, in [17], authors proposed a variational approach that matches the distributions of lip movements and speech segments to project them into a shared space, which allows for handling the high variations of in-the-wild speakers to some extent. Meanwhile, both [20,23] utilized a transformer-based approach to convert lip-to-speech synthesis into a sequence-tosequence problem, where a sequence of lip movements is translated into a sequence of speech tokens.…”

Section: Related Workmentioning

confidence: 99%

“…Comparisons: To evaluate lip-to-speech methods on the constrained single-speaker TCD-TIMIT dataset, we compare four existing approaches: (i) GAN-based [36], (ii) Lip2Wav [27], (iii) VAE-GAN [17], and (iv) VCA-GAN [20]. We adopt the same settings as Lip2Wav [27] and report the scores [27] and VCA-GAN [20].…”

Section: Speech Synthesis In Constrained Settingsmentioning

confidence: 99%

“…Comparisons: In order to assess the performance of lip-tospeech methods in unconstrained scenarios, we employ three datasets: word-level LRW [8], sentence-level LRS2 [7], and LRS3 [3]. While the authors of VAE-GAN [17] have re-trained the GAN-based [36] and Lip2Wav [27] models in a multi-speaker context, we present the scores from their original study for comparison. For VCA-GAN [20], SVTS [23], and Multi-task Lip-to-Speech synthesis [21], we adopt the speech metric (PESQ, STOI and ESTOI) scores from [21].…”

Section: Speech Synthesis In Unconstrained Settingsmentioning

confidence: 99%

“…While more recent efforts like [27] have extended lip-to-speech to "in-the-wild" environments, most of these models are speaker-specific in nature, i.e., they only work on speakers they are trained on. The speaker-independent models [17,20,21,23] suffer from numerous weaknesses and fail to accurately learn language and speech attributes like voice, prosody, etc. Due to the sheer difficulty of the task, they turn out to be well below expectations for an end-user application.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Towards Accurate Lip-to-Speech Synthesis in-the-Wild

Hegde,

Mukhopadhyay,

Jawahar

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Figure 1: We propose a novel approach for multi-speaker lip-to-speech synthesis in the wild. Prior works try to learn a language model directly from raw speech, which only provides weak supervision due to the presence of other acoustic variations such as voice, accents, and prosody. We solve this problem by relying on recent advancements in lip-to-text generation. We condition on the noisy text outputs and lip video to generate natural speech with clearly pronounced words.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%