“…Several more recent studies [17,20,21,23] have attempted to develop end-to-end lipto-speech synthesis models on large datasets containing data from hundreds of speakers. For instance, in [17], authors proposed a variational approach that matches the distributions of lip movements and speech segments to project them into a shared space, which allows for handling the high variations of in-the-wild speakers to some extent. Meanwhile, both [20,23] utilized a transformer-based approach to convert lip-to-speech synthesis into a sequence-tosequence problem, where a sequence of lip movements is translated into a sequence of speech tokens.…”