DeepConversion: Voice conversion with limited parallel training data

Zhang, Mingyang; Şişman, Berrak; Zhao, Li; Li, Haizhou

doi:10.1016/j.specom.2020.05.004

Cited by 20 publications

(5 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mel-cepstral distortion (MCD) [246] is commonly used to measure the difference between two spectral features [62], [67], [256], [257]. It is calculated between the converted and target Mel-cepstral coefficients, or MCEPs, [258], [259], y and y.…”

Section: A Objective Evaluation 1) Spectrum Conversionmentioning

confidence: 99%

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman

Yamagishi

King

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

208

View full text Add to dashboard Cite

Section: A Objective Evaluation 1) Spectrum Conversionmentioning

confidence: 99%

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman

Yamagishi

King

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

208

View full text Add to dashboard Cite

“…Text-based approaches that use ASR models have accurate linguistic information, are unlikely to be corrupted during voice conversion, and can even perform voice conversion between speakers of different languages if the ASR model used supports multiple languages. For example, DeepConversion utilizes an ASR model to perform voice conversion by mapping PPGs, speaker-dependent features, and Mel-Cepstral coefficients (MCEP) [25]. However, because a large amount of parallel data is required to train an ASR model to extract PPGs used in voice conversion, there may be inevitable errors in the process of extracting PPGs (owing to insufficient data) for training in a low-resource, multilingual environment.…”

Section: Introductionmentioning

confidence: 99%

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Park,

Lee,

Chun

2023

IEEE Access

View full text Add to dashboard Cite

AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires highly meticulous fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)-a fully self-supervised learning system that uses perturbations to extract speech features-is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.

show abstract

“…Many state-of-the-art VC methods [23]- [25] have been proposed and implemented for parallel and non-parallel VC. It is possible to train the parallel VC in a limited dataset [26]. If the performance of VC is not precise enough, voice augmentation for VC is possible.…”

Section: Introductionmentioning

confidence: 99%

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Wubet

Lian

2022

IEEE Access

View full text Add to dashboard Cite

Keyword recognition is the basis of speech recognition, and its application is rapidly increasing in keyword spotting, robotics, and smart home surveillance. Because of these advanced applications, improving the accuracy of keyword recognition is crucial. In this paper, we proposed voice conversion (VC) -based augmentation to increase the limited training dataset and a fusion of a convolutional neural network (CNN) and long-short term memory (LSTM) model for robust speaker-independent isolated keyword recognition. Collecting and preparing a sufficient amount of voice data for speaker-independent speech recognition is a tedious and bulky task. To overcome this, we generated new raw voices from the original voices using an auxiliary classifier conditional variational autoencoder (ACVAE) method. In this study, the main intention of voice conversion is to obtain numerous and various human-like keywords' voices that are not identical to the source and target speakers' pronunciation. Parallel VC was used to accurately maintain the linguistic content. We examined the performance of the proposed voice conversion augmentation techniques using robust deep neural network algorithms. Original training data, excluding generated voice using other data augmentation and regularization techniques, were considered as the baseline. The results showed that incorporating voice conversion augmentation into the baseline augmentation techniques and applying the CNN-LSTM model improved the accuracy of isolated keyword recognition.

show abstract

DeepConversion: Voice conversion with limited parallel training data

Cited by 20 publications

References 30 publications

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Perturbation AUTOVC: Voice Conversion From Perturbation and Autoencoder Loss

Voice Conversion Based Augmentation and a Hybrid CNN-LSTM Model for Improving Speaker-Independent Keyword Recognition on Limited Datasets

Contact Info

Product

Resources

About