“…Modern TTS models have great factorization abilities [42,82,89], allowing users to independently change text, prosody, and speaker identity (i.e., what and how something is being said by whom). Harnessing such rich latent features [72] not only facilitates the crafting of new voice personae [33,76], but also ensures that these synthesized voices encapsulate the nuances and diversity inherent to human speech.…”