ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053854
|View full text |Cite
|
Sign up to set email alerts
|

One-Shot Voice Conversion by Vector Quantization

Abstract: Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting. Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity, so these methods can further generalize to unseen speakers. The disentangle capability is achieved by vector quantization (VQ), adversarial training, or instance… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 53 publications
(24 citation statements)
references
References 27 publications
0
24
0
Order By: Relevance
“…Furthermore, we focused on a real-time VC system that can be applied to relatively limited use cases (e.g., parallel data and one-to-one speaker mapping) in this work. Our additional task would be extending our system to other VC frameworks, including non-parallel training [39], [40] and multi-speaker conversion with speaker adaptation techniques [41].…”
Section: Discussionmentioning
confidence: 99%
“…Furthermore, we focused on a real-time VC system that can be applied to relatively limited use cases (e.g., parallel data and one-to-one speaker mapping) in this work. Our additional task would be extending our system to other VC frameworks, including non-parallel training [39], [40] and multi-speaker conversion with speaker adaptation techniques [41].…”
Section: Discussionmentioning
confidence: 99%
“…Unsupervised approaches to speaker-content disentanglement are proposed in [14,15,16,17]. None of these works use an explicit disentanglement objective as proposed in this paper.…”
Section: Related Workmentioning
confidence: 99%
“…In [18], the AutoVC model was extended to an unsupervised disentanglement of timbre, pitch, rhythm and content. The VQVC [16] achieves disentanglment by applying a bottleneck in terms of vector quantization (VQ). The factorized hierarchical variational autoencoder (FHVAE) proposed in [14] unsupervisedly disentangles "sequence-level" (>200 ms) and "segment-level" (<200 ms) attributes, by restricting sequencelevel embeddings to be rather static within an utterance while also putting a bottleneck on the segment-level embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, zero-shot voice conversion, which focuses on the conversion between the speakers who are unseen in the training dataset became a new research direction. The previous methods [21,22,23] have achieved zero-shot VC by disentangling the speaker identity and speech content. Speaker embeddings are used to represent the identity of source and target speakers.…”
Section: Introductionmentioning
confidence: 99%