2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462342
|View full text |Cite
|
Sign up to set email alerts
|

High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network

Abstract: Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consistent adversarial network (CycleGAN) for nonparallel data-based VC training. A CycleGAN is a generative adversarial network (GAN) originally developed for unpaired image-to-image translation. A subjective evaluation of inter-gender conversion demonstrated that the pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
74
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 108 publications
(74 citation statements)
references
References 27 publications
0
74
0
Order By: Relevance
“…Indeed, the challenge in developing the non-parallel spectral conversion model has attracted many works within the recent years, such as: with the use of clustered spectral matching algorithms [12,13]; with adaptation/alignment of speaker model parameters [14,15]; with restricted Boltzmann machine [16]; with generative adversarial networks (GAN)-based methods [17,18]; and with variational autoencoder (VAE)-based frameworks [19,20,21,22]. In this work, we focus on the use of VAE-based system, due to its potential in employing latent space to represent common hidden aspects of speech signal, between different speakers, e.g., phonetical attributes.…”
Section: Introductionmentioning
confidence: 99%
“…Indeed, the challenge in developing the non-parallel spectral conversion model has attracted many works within the recent years, such as: with the use of clustered spectral matching algorithms [12,13]; with adaptation/alignment of speaker model parameters [14,15]; with restricted Boltzmann machine [16]; with generative adversarial networks (GAN)-based methods [17,18]; and with variational autoencoder (VAE)-based frameworks [19,20,21,22]. In this work, we focus on the use of VAE-based system, due to its potential in employing latent space to represent common hidden aspects of speech signal, between different speakers, e.g., phonetical attributes.…”
Section: Introductionmentioning
confidence: 99%
“…As the sample rate of the codebook embeddings of our system was 320 times smaller than the speech samples, the Wavenet couldn't produced satisfying result. GANs are known to be effective for achieving high-quality voice conversion with clean input data [29,30]. However, our task is more challenging due to the fact that our generated voice will always have some distortion.…”
Section: Resultsmentioning
confidence: 99%
“…5 is the full model). We also compared CycleGAN-VC2 with two stateof-the-art methods: CycleGAN-VC [29] and frame-based CycleGAN [30] (our reimplementation; we additionally used L id for stabilizing training). The comparison of one-step and two-step adversarial losses (nos.…”
Section: Objective Evaluationmentioning
confidence: 99%