Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-986
|View full text |Cite
|
Sign up to set email alerts
|

Statistical Voice Conversion with WaveNet-Based Waveform Generation

Abstract: This paper presents a statistical voice conversion (VC) technique with the WaveNet-based waveform generation. VC based on a Gaussian mixture model (GMM) makes it possible to convert the speaker identity of a source speaker into that of a target speaker. However, in the conventional vocoding process, various factors such as F0 extraction errors, parameterization errors and over-smoothing effects of converted feature trajectory cause the modeling errors of the speech waveform, which usually bring about sound qua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
70
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 78 publications
(73 citation statements)
references
References 23 publications
(22 reference statements)
3
70
0
Order By: Relevance
“…However, human perception is quite sensitive to speech quality, and that of synthesized speech highly depends on the generation model. WaveNet (WN) [4] is one of the state-of-the-art speech generation models, which has been applied to many applications, such as speech enhancement [12,13], text-to-speech (TTS) [7,9], speech coding [11], and voice conversion (VC) [15][16][17][18]. Specifically, WN is an autoregressive model that predicts a current speech sample based on a specific number of previous samples which is called the receptive field.…”
Section: Introductionmentioning
confidence: 99%
“…However, human perception is quite sensitive to speech quality, and that of synthesized speech highly depends on the generation model. WaveNet (WN) [4] is one of the state-of-the-art speech generation models, which has been applied to many applications, such as speech enhancement [12,13], text-to-speech (TTS) [7,9], speech coding [11], and voice conversion (VC) [15][16][17][18]. Specifically, WN is an autoregressive model that predicts a current speech sample based on a specific number of previous samples which is called the receptive field.…”
Section: Introductionmentioning
confidence: 99%
“…WaveNet [18] as one of the state-of-the-art audio generation models has been widely applied to various VC systems that take WN as a vocoder to generate converted waveforms from the converted acoustic features. For example, Kobayashi et al [28] combined GMM-based Mel-cepstral coefficient (mcep) conversion and linear transformation of prosodic features with the WN vocoder. Furthermore, in our previous works, we explored the effectiveness of different mcep conversion models with the WN vocoder, including a DNN [25,29], deep mixture density network (DMDN) [26], VAE [30], long short-term memory (LSTM) [31], and gated recurrent unit (GRU) [32].…”
Section: Related Workmentioning
confidence: 99%
“…The NU VC system uses a WaveNet-based vocoder [17,18,19] to model the waveform of the target speaker and generate the converted waveform using estimated speech features. Several flows are used in producing the estimated spectral features, where the direct waveform modification [2] method is employed.…”
Section: Waveform-processing Modulementioning
confidence: 99%
“…On the other hand, in the handling of prosodic parameters, such as fundamental frequency (F0), several methods have been commonly used including a simple mean/variance linear transformation, a contour-based transformation [13], GMM-based mapping [14], and neural network [15]. For waveform generation, approaches include the source-filter vocoder system [16], the latest direct waveform modification technique [2], and the use of state-ofthe-art WaveNet modeling [17,18,19].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation