Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3548081
|View full text |Cite
|
Sign up to set email alerts
|

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

Abstract: Figure 1: We address the problem of generating speech from silent lip videos for any speaker in the wild. Previous works train either on large amounts of data of isolated speakers or in laboratory settings with a limited vocabulary. Conversely, we can generate speech for the lip movements of arbitrary identities in any voice without additional speaker-specific fine-tuning. Our new VAE-GAN approach allows us to learn strong audio-visual associations despite the ambiguous nature of the task.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(6 citation statements)
references
References 26 publications
0
6
0
Order By: Relevance
“…Lip2Wav worked reasonably well on speakers seen during training but did not extend to handling unseen speakers. Several more recent studies [17,20,21,23] have attempted to develop end-to-end lipto-speech synthesis models on large datasets containing data from hundreds of speakers. For instance, in [17], authors proposed a variational approach that matches the distributions of lip movements and speech segments to project them into a shared space, which allows for handling the high variations of in-the-wild speakers to some extent.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Lip2Wav worked reasonably well on speakers seen during training but did not extend to handling unseen speakers. Several more recent studies [17,20,21,23] have attempted to develop end-to-end lipto-speech synthesis models on large datasets containing data from hundreds of speakers. For instance, in [17], authors proposed a variational approach that matches the distributions of lip movements and speech segments to project them into a shared space, which allows for handling the high variations of in-the-wild speakers to some extent.…”
Section: Related Workmentioning
confidence: 99%
“…Several more recent studies [17,20,21,23] have attempted to develop end-to-end lipto-speech synthesis models on large datasets containing data from hundreds of speakers. For instance, in [17], authors proposed a variational approach that matches the distributions of lip movements and speech segments to project them into a shared space, which allows for handling the high variations of in-the-wild speakers to some extent. Meanwhile, both [20,23] utilized a transformer-based approach to convert lip-to-speech synthesis into a sequence-tosequence problem, where a sequence of lip movements is translated into a sequence of speech tokens.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations