LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen, Heiga; Dang, Viet Chau; Clark, Rob; Yu, Zhang; Weiss, Ron; Jia, Yuheng; Chen, Zhifeng; Wu, Yonghui

doi:10.21437/interspeech.2019-2441

Cited by 423 publications

(198 citation statements)

References 27 publications

Supporting

Mentioning

181

Contrasting

Unclassified

Order By: Relevance

“…Some utterances start or end in the middle of a sentence, leading to unnatural pronounciation at the beginning and end of utterances. These problems were also adressed in [25]. To remove unnatural pauses and long pauses in general, we apply the FFMPEG silenceremove filter 2 with a threshold of -40dB.…”

Section: Data Preprocessingmentioning

confidence: 99%

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Rossenbach

Zeyer

Schlüter

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end ASR systems without the necessity of parameter or architecture changes. We compare our method with language model integration of the same text data and with simple data augmentation methods like SpecAugment and show that performance improvements are mostly independent. We achieve improvements of up to 33% relative in word-error-rate (WER) over a strong baseline with data-augmentation in a low-resource environment (LibriSpeech-100h), closing the gap to a comparable oracle experiment by more than 50%. We also show improvements of up to 5% relative WER over our most recent ASR baseline on LibriSpeech-960h.

show abstract

Section: Data Preprocessingmentioning

confidence: 99%

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Rossenbach

Zeyer

Schlüter

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Here we explore the impact of training Conv-TasNet and the deep encoder/decoder on a larger, more diverse, training set: LibriTTS [30]. Our goal is to compare the SI-SNRi performance of these two architectures when using the WSJ and LibriTTS datasets for trainingand the WSJ, LibriTTS, and VCTK [31] datasets for evaluation.…”

Section: Cross-dataset Evaluationmentioning

confidence: 99%

An Empirical Study of Conv-Tasnet

Kadıoğlu

Horgan

Liu

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Conv-TasNet is a recently proposed waveform-based deep neural network that achieves state-of-the-art performance in speech source separation. Its architecture consists of a learnable encoder/decoder and a separator that operates on top of this learned space. Various improvements have been proposed to Conv-TasNet. However, they mostly focus on the separator, leaving its encoder/decoder as a (shallow) linear operator. In this paper, we conduct an empirical study of Conv-TasNet and propose an enhancement to the encoder/decoder that is based on a (deep) non-linear variant of it. In addition, we experiment with the larger and more diverse LibriTTS dataset and investigate the generalization capabilities of the studied models when trained on a much larger dataset. We propose cross-dataset evaluation that includes assessing separations from the WSJ0-2mix, Lib-riTTS and VCTK databases. Our results show that enhancements to the encoder/decoder can improve average SI-SNR performance by more than 1 dB. Furthermore, we offer insights into the generalization capabilities of Conv-TasNet and the potential value of improvements to the encoder/decoder.

show abstract

“…We train our models using the LJSpeech (LJS) dataset [16], the Sally dataset, a proprietary single speaker dataset with 20 hours, and a subset of LibriTTS [17]. All datasets used in our experiments are from read speech.…”

Section: Methodsmentioning

confidence: 99%

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Valle

Prenger

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

show abstract

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Cited by 423 publications

References 27 publications

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

An Empirical Study of Conv-Tasnet

Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens

Contact Info

Product

Resources

About