2018
DOI: 10.48550/arxiv.1804.02812
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
41
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
7

Relationship

3
4

Authors

Journals

citations
Cited by 20 publications
(41 citation statements)
references
References 20 publications
0
41
0
Order By: Relevance
“…Similarly, Hung et al [19] explored disentangled representations for timbre and pitch of musical sounds useful for music editing. Chou et al [20] explored the disentanglement of speaker characteristics from linguistic content in speech signals for voice conversion. Chen et al [21] explored the idea of disentangling phonetic and speaker information for the task of audio representation.…”
Section: Representation Learningmentioning
confidence: 99%
“…Similarly, Hung et al [19] explored disentangled representations for timbre and pitch of musical sounds useful for music editing. Chou et al [20] explored the disentanglement of speaker characteristics from linguistic content in speech signals for voice conversion. Chen et al [21] explored the idea of disentangling phonetic and speaker information for the task of audio representation.…”
Section: Representation Learningmentioning
confidence: 99%
“…Vector quantization based methods [14] are further proposed to model content information as discrete distributions which are more related to the distribution of phonetic information. An auxiliary adversarial speaker classifier is adopted [15] to encourage the encoder to cast away speaker information from content information by minimizing the mutual information between their representations [16].…”
Section: Introductionmentioning
confidence: 99%
“…While disentangling natural image representation has been studied extensively, much less work has focused on natural speech, leaving a rather large void in the understanding of this problem. In this paper, we first present a short review and comparison of two representative efforts on this topic [6,7], where both efforts involve using an auto-encoder and can be applied to the same task (i.e., voice conversion), but the key disentangling algorithms and underlying ideas are very different.…”
Section: Introductionmentioning
confidence: 99%
“…Different from [6], in [7], the authors propose a supervised approach based on adversarial training [11,12,13,14](illustrated in Figure 2 (left)). In addition to a regular autoencoder, the authors add a regularization term in its objective function to force the latent variables (i.e., the encoding) to not contain speaker information.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation