WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications

Morise, Masanori; Yokomori, Fumiya; Ozawa, Kenji

doi:10.1587/transinf.2015edp7457

Cited by 976 publications

(646 citation statements)

References 33 publications

Supporting

Mentioning

597

Contrasting

Unclassified

Order By: Relevance

“…We use an acoustic frontend based on the WORLD vocoder [21] (D4C edition [22]) with a 32 kHz sample rate and 5 ms hop time. The dimensionality of the harmonic component is reduced to 60 log Mel-Frequency Spectral Coefficients (MFSCs) by truncated frequency warping in the cepstral domain [23] with an all-pole filter with warping coefficient α = 0.45.…”

Section: Acoustic and Control Frontendmentioning

confidence: 99%

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

2017

View full text Add to dashboard Cite

Abstract:We recently presented a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Nonetheless, compared to modeling waveform directly, ways of effectively handling higher-dimensional outputs, multiple feature streams and regularization become more important with our approach. In this work, we extend our proposed system to include additional components for predicting F0 and phonetic timings from a musical score with lyrics. These expression-related features are learned together with timbrical features from a single set of natural songs. We compare our method to existing statistical parametric, concatenative, and neural network-based approaches using quantitative metrics as well as listening tests.

show abstract

Section: Acoustic and Control Frontendmentioning

confidence: 99%

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

2017

View full text Add to dashboard Cite

show abstract

“…Some vocoders have been investigated from a simple mel-log spectrum approximate (MLSA) filter with a simple pulse excitation and melcepstrum [1] to high-quality ones, such as STRAIGHT [2] and WORLD [3]. However, these high-quality vocoders are intended to analyze and convert high-quality speech and a number of acoustic parameters necessary to synthesize speech with the same quality as the original, but not for TTS.…”

Section: Introductionmentioning

confidence: 99%

“…DNN improves synthesis accuracy compared to the conventional hidden Markov model (HMM) [5,6]. Additionally corpus-dependent high-quality vocoders with DNNs have been investigated [7,8], whereas the conventional high-quality ones described above [2,3] are corpus-independent. Although corpus-dependent high-quality vocoders with DNNs improve the speech quality compared to the conventional STRAIGHT vocoder in both HMMand DNN-based speech synthesis [7], the synthesis quality depends greatly on the estimation accuracy of the glottal closure instants [9].…”

Section: Introductionmentioning

confidence: 99%

Deep neural network-based power spectrum reconstruction to improve quality of vocoded speech with limited acoustic parameters

Okamoto

Tachibana

Toda

et al. 2018

Acoust. Sci. & Tech.

View full text Add to dashboard Cite

IntroductionText to speech synthesis (TTS) is an important technique in multilingual spoken language communications. Due to the flexibility and small footprint, statistical parametric speech synthesis (SPSS) has become mainstream in TTS. This paper investigates the improvement of the vocoded speech quality, which is a problem in SPSS.The vocoder in SPSS is a module that converts acoustic features estimated from linguistic information by acoustic models into speech waveforms. Some vocoders have been investigated from a simple mel-log spectrum approximate (MLSA) filter with a simple pulse excitation and melcepstrum [1] to high-quality ones, such as STRAIGHT [2] and WORLD [3]. However, these high-quality vocoders are intended to analyze and convert high-quality speech and a number of acoustic parameters necessary to synthesize speech with the same quality as the original, but not for TTS. The number of parameters must be reduced to apply these highquality vocoders to SPSS due to constraints on the number of parameters [4]. These constraints deteriorate synthesis quality even if the acoustic model perfectly estimates the acoustic parameters. In other words, speech quality in TTS will reach a peak due to the vocoder performance. Herein a method is investigated to improve this upper limit.In SPSS, an acoustic model is trained from speech corpora and the maximum likelihood model parameters are estimated. Recently, deep neural networks (DNNs) have been introduced for acoustic model training in SPSS. DNN improves synthesis accuracy compared to the conventional hidden Markov model (HMM) [5,6]. Additionally corpus-dependent high-quality vocoders with DNNs have been investigated [7,8], whereas the conventional high-quality ones described above [2,3] are corpus-independent. Although corpus-dependent high-quality vocoders with DNNs improve the speech quality compared to the conventional STRAIGHT vocoder in both HMMand DNN-based speech synthesis [7], the synthesis quality depends greatly on the estimation accuracy of the glottal closure instants [9]. Neural network-based vocoders such as WaveNet y and SampleRNN [8] require a huge amount of speech corpus for high-quality synthesis.

show abstract

“…The speech signal sampling rate was 22,050 Hz. The WORLD [23,24] package was used in speech analysis. From a speech signal, 35-dimensional mel-cepstrum parameters including the 0th power coefficient, F0 values, and 513-dimensional aperiodicity features, which were coded into two-band aperiodicity parameters, were used.…”

Section: Experimental Conditionsmentioning

confidence: 99%

NU Voice Conversion System for the Voice Conversion Challenge 2018

Tobing¹,

Wu²,

Hayashi³

et al. 2018

EasyChair Preprints

View full text Add to dashboard Cite

This paper presents the NU (Nagoya University) voice conversion (VC) system for the HUB task of the Voice Conversion Challenge 2018 (VCC 2018). The design of the NU VC system can basically be separated into two modules consisting of a speech parameter conversion module and a waveformprocessing module. In the speech parameter conversion module, a deep learning framework is deployed to estimate the spectral parameters of a target speaker given those of a source speaker. Specifically, a deep neural network (DNN) and a deep mixture density network (DMDN) are used as the deep model structure. In the waveform-processing module, given the estimated spectral parameters and linearly transformed F0 parameters, the converted waveform is generated using a WaveNet-based vocoder system. To use the WaveNet-based vocoder, there are several generation flows based on an analysissynthesis framework to obtain the speech parameter set, on the basis of which a system selection process is performed to select the best one in an utterance-wise manner. The results of VCC 2018 ranked the NU VC system in second place with an overall mean opinion score (MOS) of 3.44 for speech quality and 85% accuracy for speaker similarity.

show abstract

WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications

Cited by 976 publications

References 33 publications

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs

Deep neural network-based power spectrum reconstruction to improve quality of vocoded speech with limited acoustic parameters

NU Voice Conversion System for the Voice Conversion Challenge 2018

Contact Info

Product

Resources

About