An investigation of multi-speaker training for wavenet vocoder

Hayashi, Tomoki; Tamamori, Akira; Kobayashi, Kenzo; Takeda, Kazuya; Toda, Tomoki

doi:10.1109/asru.2017.8269007

Cited by 106 publications

(114 citation statements)

References 17 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…The NU VC system uses a WaveNet-based vocoder [17,18,19] to model the waveform of the target speaker and generate the converted waveform using estimated speech features. Several flows are used in producing the estimated spectral features, where the direct waveform modification [2] method is employed.…”

Section: Waveform-processing Modulementioning

confidence: 99%

“…On the other hand, in the handling of prosodic parameters, such as fundamental frequency (F0), several methods have been commonly used including a simple mean/variance linear transformation, a contour-based transformation [13], GMM-based mapping [14], and neural network [15]. For waveform generation, approaches include the source-filter vocoder system [16], the latest direct waveform modification technique [2], and the use of state-ofthe-art WaveNet modeling [17,18,19].…”

Section: Introductionmentioning

confidence: 99%

“…In the waveform-processing module, the NU VC system deploys the state-of-the-art WaveNet-based vocoder [17,18,19] framework to directly model the waveform. In WaveNet [21], each waveform sample is conditioned using previous samples and possible auxiliary features within a stack of dilated convolutional layers.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NU Voice Conversion System for the Voice Conversion Challenge 2018

Tobing¹,

Wu²,

Hayashi³

et al. 2018

EasyChair Preprints

Self Cite

View full text Add to dashboard Cite

This paper presents the NU (Nagoya University) voice conversion (VC) system for the HUB task of the Voice Conversion Challenge 2018 (VCC 2018). The design of the NU VC system can basically be separated into two modules consisting of a speech parameter conversion module and a waveformprocessing module. In the speech parameter conversion module, a deep learning framework is deployed to estimate the spectral parameters of a target speaker given those of a source speaker. Specifically, a deep neural network (DNN) and a deep mixture density network (DMDN) are used as the deep model structure. In the waveform-processing module, given the estimated spectral parameters and linearly transformed F0 parameters, the converted waveform is generated using a WaveNet-based vocoder system. To use the WaveNet-based vocoder, there are several generation flows based on an analysissynthesis framework to obtain the speech parameter set, on the basis of which a system selection process is performed to select the best one in an utterance-wise manner. The results of VCC 2018 ranked the NU VC system in second place with an overall mean opinion score (MOS) of 3.44 for speech quality and 85% accuracy for speaker similarity.

show abstract

Section: Waveform-processing Modulementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NU Voice Conversion System for the Voice Conversion Challenge 2018

Tobing¹,

Wu²,

Hayashi³

et al. 2018

EasyChair Preprints

Self Cite

View full text Add to dashboard Cite

show abstract

“…To address this issue, many neural-based vocoders [18][19][20][21][22][23] have been proposed to replace the traditional vocoders in the synthesis part of VC. In this paper, we focus on the WaveNet (WN) vocoder [18][19][20][21], which is an autoregressive model conditioned on auxiliary features to generate a raw waveform without many handcrafted assumptions. Although the WN vocoder generate high-fidelity speech conditioned on the training acoustic features, the fixed network architectures of WN are not efficient and may reduce the robustness against unseen fundamental frequency (F0) features that are not observed in the range of training data.…”

Section: Introductionmentioning

confidence: 99%

Statistical Voice Conversion with Quasi-periodic WaveNet Vocoder

Wu¹,

Tobing²,

Hayashi³

et al. 2019

10th ISCA Workshop on Speech Synthesis (SSW 10)

Self Cite

View full text Add to dashboard Cite

In this paper, we investigate the effectiveness of a quasi-periodic WaveNet (QPNet) vocoder combined with a statistical spectral conversion technique for a voice conversion task. The WaveNet (WN) vocoder has been applied as the waveform generation module in many different voice conversion frameworks and achieves significant improvement over conventional vocoders. However, because of the fixed dilated convolution and generic network architecture, the WN vocoder lacks robustness against unseen input features and often requires a huge network size to achieve acceptable speech quality. Such limitations usually lead to performance degradation in the voice conversion task. To overcome this problem, the QPNet vocoder is applied, which includes a pitch-dependent dilated convolution component to enhance the pitch controllability and attain a more compact network than the WN vocoder. In the proposed method, input spectral features are first converted using a framewise deep neural network, and then the QPNet vocoder generates converted speech conditioned on the linearly converted prosodic and transformed spectral features. The experimental results confirm that the QPNet vocoder achieves significantly better performance than the same-size WN vocoder while maintaining comparable speech quality to the double-size WN vocoder.

show abstract

“…In recent years, there are two mainstreams that try to improve the waveform generation module. One direction is to develop neural vocoders [11,12,13,14,15,16,17,18,19], which are capable of reconstructing the phase and excitation information, and thus generate extremely natural sounding speech.…”

Section: Introductionmentioning

confidence: 99%

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Huang¹,

Wu²,

Kobayashi³

et al. 2019

10th ISCA Workshop on Speech Synthesis (SSW 10)

Self Cite

View full text Add to dashboard Cite

We present a modification to the spectrum differential based direct waveform modification for voice conversion (DIFFVC) so that it can be directly applied as a waveform generation module to voice conversion models. The recently proposed DIFFVC avoids the use of a vocoder, meanwhile preserves rich spectral details hence capable of generating high quality converted voice. To apply the DIFFVC framework, a model that can estimate the spectral differential from the F0 transformed input speech needs to be trained beforehand. This requirement imposes several constraints, including a limitation on the estimation model to parallel training and the need of extra training on each conversion pair, which make DIFFVC inflexible. Based on the above motivations, we propose a new DIFFVC framework based on an F0 transformation in the residual domain. By performing inverse filtering on the input signal followed by synthesis filtering on the F0 transformed residual signal using the converted spectral features directly, the spectral conversion model does not need to be retrained or capable of predicting the spectral differential. We describe several details that need to be taken care of under this modification, and by applying our proposed method to a non-parallel, variational autoencoder (VAE)-based spectral conversion model, we demonstrate that this framework can be generalized to any spectral conversion model, and experimental evaluations show that it can outperform a baseline framework whose waveform generation process is carried out by a vocoder.

show abstract

An investigation of multi-speaker training for wavenet vocoder

Cited by 106 publications

References 17 publications

NU Voice Conversion System for the Voice Conversion Challenge 2018

NU Voice Conversion System for the Voice Conversion Challenge 2018

Statistical Voice Conversion with Quasi-periodic WaveNet Vocoder

Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

Contact Info

Product

Resources

About