One-Shot Voice Conversion by Vector Quantization

Wu, Da-Yi; Lee, Hung-yi

doi:10.1109/icassp40776.2020.9053854

Cited by 53 publications

(24 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, we focused on a real-time VC system that can be applied to relatively limited use cases (e.g., parallel data and one-to-one speaker mapping) in this work. Our additional task would be extending our system to other VC frameworks, including non-parallel training [39], [40] and multi-speaker conversion with speaker adaptation techniques [41].…”

Section: Discussionmentioning

confidence: 99%

Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials

Saeki

Saito

Takamichi

et al. 2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectraldifferential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a liftertraining method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multirate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.

show abstract

Section: Discussionmentioning

confidence: 99%

Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials

Saeki

Saito

Takamichi

et al. 2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Unsupervised approaches to speaker-content disentanglement are proposed in [14,15,16,17]. None of these works use an explicit disentanglement objective as proposed in this paper.…”

Section: Related Workmentioning

confidence: 99%

“…In [18], the AutoVC model was extended to an unsupervised disentanglement of timbre, pitch, rhythm and content. The VQVC [16] achieves disentanglment by applying a bottleneck in terms of vector quantization (VQ). The factorized hierarchical variational autoencoder (FHVAE) proposed in [14] unsupervisedly disentangles "sequence-level" (>200 ms) and "segment-level" (<200 ms) attributes, by restricting sequencelevel embeddings to be rather static within an utterance while also putting a bottleneck on the segment-level embeddings.…”

Section: Related Workmentioning

confidence: 99%

Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations

Ebbers

Kuhlmann

Cord-Landwehr

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed technique is capable of separating speaker and content traits into the two different representations and show competitive speakercontent disentanglement performance compared to other unsupervised approaches. We further demonstrate an increased robustness of the content representation against a train-test mismatch compared to spectral features, when used for phone recognition.

show abstract

“…Furthermore, zero-shot voice conversion, which focuses on the conversion between the speakers who are unseen in the training dataset became a new research direction. The previous methods [21,22,23] have achieved zero-shot VC by disentangling the speaker identity and speech content. Speaker embeddings are used to represent the identity of source and target speakers.…”

Section: Introductionmentioning

confidence: 99%

Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features

Tan

Wei

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Zero-shot voice conversion (VC) where both source and target speakers are unseen in the training dataset has become a new research direction. Using speaker embeddings instead of one-hot vectors to represent speaker identity is a key point, which makes VC models work on unseen speakers. In our work, a newly designed neural network was used to adjust the speaker embeddings of unseen speakers. This enables speaker embeddings to perform better on zero-shot VC. In addition, disentangled representation of features is the mainstream method to achieve zero-shot VC. In terms of input features of VC model, we use Mel-cepstral and F0 as simple acoustic features (SAF) rather than Mel-spectrograms. This avoids F0 conflicts in decoder that existed in the previous methods. The evaluations demonstrate that our proposed methods improve the quality of converted speech in terms of naturalness and similarity.

show abstract

One-Shot Voice Conversion by Vector Quantization

Cited by 53 publications

References 27 publications

Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials

Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials

Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations

Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features

Contact Info

Product

Resources

About