2017
DOI: 10.48550/arxiv.1704.00849
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
60
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 89 publications
(60 citation statements)
references
References 15 publications
0
60
0
Order By: Relevance
“…VC performance is critically dependent on the availability of the target speaker's voice data for training [11][12][13][14][15][16][17]. Hence, the challenge of one-shot VC is in performing conversion across arbitrary speakers that may be unseen during training, and with only one single target-speaker utterance for reference.…”
Section: Related Workmentioning
confidence: 99%
“…VC performance is critically dependent on the availability of the target speaker's voice data for training [11][12][13][14][15][16][17]. Hence, the challenge of one-shot VC is in performing conversion across arbitrary speakers that may be unseen during training, and with only one single target-speaker utterance for reference.…”
Section: Related Workmentioning
confidence: 99%
“…Auto-encoder [116] is one of the techniques which is commonly used for speech disentanglement and reconstruction. Studies have shown the effectiveness of autoencoder [24] and its variants [25,128,129] in disentangling the speaker information from the content, thus they are widely used in speaker voice conversion [130] and singing voice conversion [131].…”
Section: Disentanglement Between Emotional Prosody and Linguistic Con...mentioning
confidence: 99%
“…In this way, we learn the framelevel content-related representations from the speech in an unsupervised manner [50,60,48]. VAE-GAN [50] and VAW-GAN [25,60] are successful attempts. Another idea is for an autoencoder to learn an emotion-invariant latent code and an emotion-related style code in latent space [49].…”
Section: Disentanglement Between Emotional Prosody and Linguistic Con...mentioning
confidence: 99%
See 1 more Smart Citation
“…Voice conversion has taken some major strides in terms of speech quality and speaker similarity. Various approaches have been proposed, such as Gaussian mixture model (GMM) [3,4,5], frequency warping approaches [6,7,8], exemplar based methods [9,10,11], and neural network based methods [12,13,14,15,16,17,18,19,20,21]. Recently, disentangling speaker and linguistic content representations based on deep learning for voice conversion [22,23,24,25] has received much attention.…”
Section: Introductionmentioning
confidence: 99%