Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1774
|View full text |Cite
|
Sign up to set email alerts
|

Investigation of F0 Conditioning and Fully Convolutional Networks in Variational Autoencoder Based Voice Conversion

Abstract: In this work, we investigate the effectiveness of two techniques for improving variational autoencoder (VAE) based voice conversion (VC). First, we reconsider the relationship between vocoder features extracted using the high quality vocoders adopted in conventional VC systems, and hypothesize that the spectral features are in fact F0 dependent. Such hypothesis implies that during the conversion phase, the latent codes and the converted features in VAE based VC are in fact source F0 dependent. To this end, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
11
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 14 publications
(12 citation statements)
references
References 28 publications
0
11
0
Order By: Relevance
“…Following [68], the latent space and speaker representation were set to 16-dimensional. We used a mini-batch of 16 and the Adam optimizer with a fixed learning rate of 0.0001.…”
Section: Experimental Evaluations a Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Following [68], the latent space and speaker representation were set to 16-dimensional. We used a mini-batch of 16 and the Adam optimizer with a fixed learning rate of 0.0001.…”
Section: Experimental Evaluations a Experimental Settingsmentioning
confidence: 99%
“…In this section, we investigate the degree of disentanglement of the VC models involved in this study. We use a novel metric that was recently proposed in [68] as the disentanglement measurement, termed DEM. The main design concept of DEM is that a pair of sentences of the same content uttered by the source and target speakers should have similar latent codes since the phonetic contents are the same.…”
Section: F Disentanglement Measurementioning
confidence: 99%
“…We believe that a means to address this problem is to separate emotional features from emotion-independent features. Autoencoder (AE)-or variational autoencoder (VAE)-based feature disentanglement methods provide a possible solution and have been successfully applied to the voice conversion (VC) task [15,16,17,18,19,20]. In [17], by providing the speaker identity features as a condition to the decoder, the encoder learns to encode only the speaker-independent information in the process of minimising the reconstruction loss.…”
Section: Introductionmentioning
confidence: 99%
“…Autoencoder (AE)-or variational autoencoder (VAE)-based feature disentanglement methods provide a possible solution and have been successfully applied to the voice conversion (VC) task [15,16,17,18,19,20]. In [17], by providing the speaker identity features as a condition to the decoder, the encoder learns to encode only the speaker-independent information in the process of minimising the reconstruction loss. Experimental results from [20] show that the degree of disentanglement is positively correlated with the performance of the VC model and can be enhanced by both GANs and the speaker classifier.…”
Section: Introductionmentioning
confidence: 99%
“…Subjective evaluation results of the FCN-CDVAE based VC method[37] with a waveform generation process by the WORLD vocoder or the proposed method. Here M and F denotes male and female, respectively.…”
mentioning
confidence: 99%