RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

Gabdrakhmanov, Lenar; Garaev, Rustem; Razinkov, Evgenii

doi:10.1007/978-3-030-26061-3_12

Cited by 5 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These corpora prioritise the measurement of dataset quality across various dimensions. Regarding signal quality, the Signal-to-Noise Ratio (SNR) holds significant importance, both during content filtering [36,37] and data recording stages [38][39][40][41]. Linguistic considerations also come into play, with some researchers emphasising the need for balanced phonemic or supraphonemic units within the dataset [38,39,41,42].…”

Section: Related Workmentioning

confidence: 99%

“…Additionally, text preprocessing techniques are employed to ensure accurate alignment with the uttered speech and reduce variability in pronunciations [36,[39][40][41][42][43][44]. Lastly, the quantity of audio data generated by each speaker is a critical aspect in corpus creation, particularly in datasets with a low number of speakers [36,[38][39][40][41][42][43][44].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…These corpora prioritise the measurement of dataset quality across various dimensions. Regarding signal quality, the SNR holds significant importance, both during the content filtering [36,37] and data recording stages [38][39][40][41]. Linguistic considerations also come into play, with some researchers emphasising the need for balanced phonemic or supraphonemic units within the dataset [38,39,41,42].…”

Section: Related Workmentioning

confidence: 99%

“…Additionally, text preprocessing techniques are employed to ensure accurate alignment with the uttered speech and to reduce variability in pronunciations [36,[39][40][41][42][43][44]. Lastly, the quantity of audio data generated by each speaker is a critical aspect in corpus creation, particularly in datasets with a low number of speakers [36,[38][39][40][41][42][43][44].…”

Section: Related Workmentioning

confidence: 99%

Enhancing Voice Cloning Quality through Data Selection and Alignment-Based Metrics

González-Docasal

Álvarez

2023

Applied Sciences

View full text Add to dashboard Cite

Voice cloning, an emerging field in the speech-processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigated the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also used two high-quality corpora for comparative analysis. We conducted exhaustive evaluations of the quality of the gathered corpora in order to select the most-suitable data for the training of a voice-cloning system. Following these measurements, we conducted a series of ablations by removing audio files with a lower signal-to-noise ratio and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduced a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 text-to-speech system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice-cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increased the quality of synthesised audio for the challenging low-quality corpus. Notably, our findings indicated that models trained on a 3 h corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

show abstract