“…They either leverage an external text-to-acoustic alignment module (Ren et al, 2019;2021a;Peng et al, 2020;Elias et al, 2021) or jointly train one within the TTS model (Zeng et al, 2020;Miao et al, 2021;Badlani et al, 2021). Other generative models have also been studied for TTS, such as Flow-based models Miao et al, 2020), variational autoencoder (VAE)-based models (Lee et al, 2021;Liu et al, 2021b), and generative adversarial network (GAN)-based models (Donahue et al, 2021;Yang et al, 2021). TTS models combining different generative modeling techniques are also investigated, such as Flow with VAE (Ren et al, 2021b), Flow with VAE and GAN (Kim et al, 2021).…”