The SWARA speech corpus: A large parallel Romanian read speech dataset

Stan, Adriana; Dinescu, Florina Veronica; Ţiple, Cristina; Meza, Serban; Orza, Bogdan; Chirilă, Magdalena; Giurgiu, Mircea

doi:10.1109/sped.2017.7990428

Cited by 22 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…High-quality speech corpora is essential in neural speech synthesis. In this work we start from the large parallel Romanian dataset called SWARA [20]. SWARA contains 17 volunteer speakers each reading aloud between 1000 and 1500 utterances (the same across all speakers) in a controlled studio environment.…”

Section: Speech Corpusmentioning

confidence: 99%

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Lorincz¹,

Stan²,

Giurgiu³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all.In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress markings.We evaluate the results of the MSPK system using the objective measures of equal error rate (EER) and word error rate (WER), and also look into the distances between natural and synthesised t-SNE projections of the embeddings computed by an accurate speaker verification network. The results show that there is indeed a large correlation between the recording conditions and the speaker's synthetic voice quality. The speaker gender does not influence the output, and that extending the input text representation with syllable boundaries and lexical stress information does not equally enhance the generated audio across all speaker identities. The visualisation of the t-SNE projections of the natural and synthesised speaker embeddings show that the acoustic model shifts some of the speakers' neural representation, but not all of them. As a result, these speakers have lower performances of the output speech.

show abstract

Section: Speech Corpusmentioning

confidence: 99%

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Lorincz¹,

Stan²,

Giurgiu³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The training data for our systems consists of the SWARA Romanian multispeaker parallel corpus [24]. It includes 18 speakers: 10 female and 8 male voices, with the number of utterances per speaker being between 1000 and 1500.…”

Section: A Training Data and Speaker Data Augmentationmentioning

confidence: 99%

Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

Lorincz¹,

Stan²,

Giurgiu³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Building multispeaker neural network-based textto-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness.In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while the data augmentation improves the intelligibility of the multispeaker TTS system.

show abstract

“…The Irish script was generated from the Corpas na Gaeilge Comhaimseartha (Corpus of Contemporary Irish) [17]. The Romanian script was developed using the The SWARA Speech Corpus [18]. Not all accents have unique recording scripts.…”

Section: Current Resources 21 Language Resourcesmentioning

confidence: 99%

All Together Now: The Living Audio Dataset

Braude

Aylett

Laoide-Kemp³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

The ongoing focus in speech technology research on machine learning based approaches leaves the community hungry for data. However, datasets tend to be recorded once and then released, sometimes behind registration requirements or paywalls. In this paper we describe our Living Audio Dataset. The aim is to provide audio data that is in the public domain, multilingual, and expandable by communities. We discuss the role of linguistic resources, given the success of systems such as Tacotron which use direct text-to-speech mappings, and consider how data provenance could be built into such resources. So far the data has been collected for TTS purposes, however, it is also suitable for ASR. At the time of publication audio resources already exist for Dutch, R.

show abstract

The SWARA speech corpus: A large parallel Romanian read speech dataset

Cited by 22 publications

References 7 publications

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

All Together Now: The Living Audio Dataset

Contact Info

Product

Resources

About