Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-341
|View full text |Cite
|
Sign up to set email alerts
|

RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis

Abstract: This paper introduces RyanSpeech, a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or lack quality male speech data. In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from real-world conversational settings. These materials contain over 10 hours of a professio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 16 publications
0
7
0
Order By: Relevance
“…Additionally, text preprocessing techniques are employed to ensure accurate alignment with the uttered speech and reduce variability in pronunciations [36,[39][40][41][42][43][44]. Lastly, the quantity of audio data generated by each speaker is a critical aspect in corpus creation, particularly in datasets with a low number of speakers [36,[38][39][40][41][42][43][44].…”
Section: Related Workmentioning
confidence: 99%
“…Additionally, text preprocessing techniques are employed to ensure accurate alignment with the uttered speech and reduce variability in pronunciations [36,[39][40][41][42][43][44]. Lastly, the quantity of audio data generated by each speaker is a critical aspect in corpus creation, particularly in datasets with a low number of speakers [36,[38][39][40][41][42][43][44].…”
Section: Related Workmentioning
confidence: 99%
“…We used three datasets to train systems. For our base-model we used a scripted conversational corpus, RyanSpeech corpus [22]. This corpus contains 10 hours (11,279 utterances) of a male speaker of US English reading textual materials from conversational settings.…”
Section: Datamentioning
confidence: 99%
“…Creating a high-quality speech synthesizer demands highquality single-speaker corpus [29] unlike automatic speech recognition (ASR), which requires a diverse multi-speaker corpus to capture different accents, speaker characteristics, and acoustic environments. The voice talents who record the speech are usually highly trained, fluent, and have experience recording speech.…”
Section: Related Workmentioning
confidence: 99%