RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

Zandie, Rohola; Mahoor, Mohammad H.; Madsen, Julia; Emamian, Eshrat S.

doi:10.21437/interspeech.2021-341

Cited by 11 publications

(8 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, text preprocessing techniques are employed to ensure accurate alignment with the uttered speech and reduce variability in pronunciations [36,[39][40][41][42][43][44]. Lastly, the quantity of audio data generated by each speaker is a critical aspect in corpus creation, particularly in datasets with a low number of speakers [36,[38][39][40][41][42][43][44].…”

Section: Related Workmentioning

confidence: 99%

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

Voice cloning, an emerging field in the speech processing area, aims to generate synthetic utterances that closely resemble the voices of specific individuals. In this study, we investigate the impact of various techniques on improving the quality of voice cloning, specifically focusing on a low-quality dataset. To contrast our findings, we also use two high-quality corpora for comparative analysis. We conduct exhaustive evaluations of the quality of the gathered corpora in order to select the most suitable audios for the training of a Voice Cloning system. Following these measurements, we conduct a series of ablations by removing audios with lower SNR and higher variability in utterance speed from the corpora in order to decrease their heterogeneity. Furthermore, we introduce a novel algorithm that calculates the fraction of aligned input characters by exploiting the attention matrix of the Tacotron 2 Text-to-Speech (TTS) system. This algorithm provides a valuable metric for evaluating the alignment quality during the voice cloning process. We present the results of our experiments, demonstrating that the performed ablations significantly increase the quality of synthesised audios for the challenging low-quality corpus. Notably, our findings indicate that models trained on a 3-hour corpus from a pre-trained model exhibit comparable audio quality to models trained from scratch using significantly larger amounts of data.

show abstract

Section: Related Workmentioning

confidence: 99%

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

González-Docasal¹,

Álvarez²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We used three datasets to train systems. For our base-model we used a scripted conversational corpus, RyanSpeech corpus [22]. This corpus contains 10 hours (11,279 utterances) of a male speaker of US English reading textual materials from conversational settings.…”

Section: Datamentioning

confidence: 99%

Prosody-Controllable Spontaneous TTS with Neural HMMS

Lameris

Mehta

Henter

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system's capability of synthesizing two types of creaky voice.

show abstract

“…Creating a high-quality speech synthesizer demands highquality single-speaker corpus [29] unlike automatic speech recognition (ASR), which requires a diverse multi-speaker corpus to capture different accents, speaker characteristics, and acoustic environments. The voice talents who record the speech are usually highly trained, fluent, and have experience recording speech.…”

Section: Related Workmentioning

confidence: 99%

Building African Voices

Perez¹,

Neubig²,

Black³

2022

Preprint

View full text Add to dashboard Cite

Modern speech synthesis techniques can produce naturalsounding speech given sufficient high-quality data and compute resources. However, such data is not readily available for many languages. This paper focuses on speech synthesis for low-resourced African languages, from corpus creation to sharing and deploying the Text-to-Speech (TTS) systems. We first create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources and subject-matter expertise. Next, we create new datasets and curate datasets from "found" data (existing recordings) through a participatory approach while considering accessibility, quality, and breadth. We demonstrate that we can develop synthesizers that generate intelligible speech with 25 minutes of created speech, even when recorded in suboptimal environments. Finally, we release the speech data, code, and trained voices for 12 African languages to support researchers and developers.

show abstract

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

Cited by 11 publications

References 16 publications

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

Enhancing Voice Cloning Quality through Data Selection and Alignment-based Metrics

Prosody-Controllable Spontaneous TTS with Neural HMMS

Building African Voices

Contact Info

Product

Resources

About