2017
DOI: 10.48550/arxiv.1710.07654
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Abstract: We present Deep Voice 3, a fully-convolutional attention-based neural textto-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training an order of magnitude faster. We scale Deep Voice 3 to dataset sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare sever… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
85
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 97 publications
(85 citation statements)
references
References 13 publications
0
85
0
Order By: Relevance
“…WaveNet can be used as a speech vocoder by conditioning auxiliary features such as Mel-spectrogram and acoustic features extracted by conventional signal processing-based vocoder [2]. This is also used in state-of-the-art speech synthesis systems, which greatly contributes to improving the quality of synthesized speech [3,4]. However, WaveNet suffers from slow inference speed because of the AR mechanism and huge network architectures.…”
Section: Introductionmentioning
confidence: 99%
“…WaveNet can be used as a speech vocoder by conditioning auxiliary features such as Mel-spectrogram and acoustic features extracted by conventional signal processing-based vocoder [2]. This is also used in state-of-the-art speech synthesis systems, which greatly contributes to improving the quality of synthesized speech [3,4]. However, WaveNet suffers from slow inference speed because of the AR mechanism and huge network architectures.…”
Section: Introductionmentioning
confidence: 99%
“…With the increasing popularity of voice assistant, virtual reality and other artificial intelligence technologies, text-tospeech (TTS) is becoming an important component in a wide range of applications. While recent advancements in neural TTS technologies have brought significant improvement in audio quality, efficient synthesis remains challenging in many scenarios [1][2][3][4][5][6][7][8]. In practical applications, latency, computational complexity, synthesis speed and streamability are key metrics for a production TTS system, especially if it has limited computational resources such as on mobile devices.…”
Section: Introductionmentioning
confidence: 99%
“…Correspondence to Qing He: qinghe@fb.com Tacotron2 [2], FastSpeech [3], Deep voice 3 [8] etc, usually synthesize speech in two stages: 1: generate the speech spectrum from text; and 2: generate the speech waveform by conditioning on the predicted spectrum. We focus on the problem of two-stage TTS system design and propose a spectrum model that achieves low latency, supports streaming and produces high-quality TTS at the same time.…”
Section: Introductionmentioning
confidence: 99%
“…A development in TTS technology that is relevant to our work has been the introduction of attention-based architectures [12] such as Tacotron [13,14] and Deep Voice [15], attention-based sequence-to-sequence (AS2S) models which predict spectrograms that are subsequently used to synthesise a waveform with a vocoder. For the purposes of this paper, the most salient feature of AS2S architectures is that its only conditioning input is text (or a corresponding phoneme list), and not any additional model or piece of information.…”
Section: Introductionmentioning
confidence: 99%