Avocodo: Generative Adversarial Network for Artifact-Free Vocoder

Bak, Taejun; Lee, Junmo; Bae, Hanbin; Yang, Jinhyeok; Bae, Jong‐Sup; Joo, Yung Hyup

doi:10.1609/aaai.v37i11.26479

Cited by 8 publications

(1 citation statement)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mehta et al [29], for example, propose the use of Neural Hidden Markov Models (HMMs) with normalising flows as an acoustic model. In the context of generating speech signal from mel spectrograms, vocoders based on Generative Adversarial Networks (GANs) [8,[30][31][32] have gained popularity due to their efficient inference speed, lightweight networks, and ability to produce high-quality waveforms. Furthermore, end-to-end models like VITS [33] or YourTTS [34] have been developed, enabling direct generation of audio signals from linguistic input without the need of an additional vocoder model.…”

Section: Related Workmentioning

confidence: 99%

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

show abstract