Voice Conversion for Dubbing Using Linear Predictive Coding and Hidden Markov Model

Mukhneri, Firra M.; Wijayanto, Inung; Hadiyoso, Sugondo

doi:10.35741/issn.0258-2724.55.4.33

Cited by 7 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our work is different from all of these in several important aspects: (a) first, our work considered using real media data from professional voice talents, (b) next, in [18], dubbing was performed for Indonesian language using several words only, (c) in [17], VC was performed for data augmentation and did not investigate target speaker quality and similarity, (d) and finally, in [19] VC was applied to the output of a speech synthesizer.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Adapting Pretrained Models for Adult to Child Voice Conversion

Sudro,

Ragni,

Hain

2023

2023 31st European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

Due to widespread lack of parallel data for adult to child voice conversion (VC), non parallel VC techniques have grown in popularity. Methods, such as encoder-decoder model, have achieved good performance in adult-to-adult VC. It provides flexibility by either training each module separately or exploit pretrained models. These pretrained models are only available for adult speech. In case of children speech, we do not have enough data to train all the modules of a robust encoder-decoder based VC system. In a limited data scenario, we can only train the decoder module using target speech. Specifically, we find that adult to child VC using a pretrained encoder and trained decoder with child speech does not yield spectral variability of a child speech. The reason being gross spectral mismatch between adult and child speech. We address this mismatch by exploiting a warping mechanism to modify the acoustic attributes based on child speech. We conduct objective and subjective evaluations on CMU and CSLU kids corpus and one adult actress data. Results show that the proposed method reduces MCD and F0 RMSE by 0.67 and 0.03 respectively. For subjective evaluations we observe a relative MOS improvement of 10.7% for naturalness and 18.23% for similarity.

show abstract

Section: Introductionmentioning

confidence: 99%

“…One of the studies reported using read speech for training adult to child CycleGAN VC model for ASR application [17]. Other studies by [18], [19] reported using Gaussian mixture model (GMM) based adult to child VC for speaker adaptation and dubbing.…”

Section: Introductionmentioning

confidence: 99%

Adapting Pretrained Models for Adult to Child Voice Conversion

Sudro,

Ragni,

Hain

2023

2023 31st European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

show abstract

“…Once having a good disentangle strategy, the model can generate a high quality of speech from the given utterance and style. A successful VC can be applied to various fields, such as personal electrical support as an audio assistant (Lu et al 2021), entertainment usage for dubbing (Mukhneri, Wijayanto, and Hadiyoso 2020), and industrial applications for voice changers, etc.…”

Section: Introductionmentioning

confidence: 99%

Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario

Weng

Shuai

Cheng

2023

AAAI

View full text Add to dashboard Cite

Often a face has a voice. Appearance sometimes has a strong relationship with one's voice. In this work, we study how a face can be converted to a voice, which is a face-based voice conversion. Since there is no clean dataset that contains face and speech, voice conversion faces difficult learning and low-quality problems caused by background noise or echo. Too much redundant information for face-to-voice also causes synthesis of a general style of speech. Furthermore, previous work tried to disentangle speech with bottleneck adjustment. However, it is hard to decide on the size of the bottleneck. Therefore, we propose a bottleneck-free strategy for speech disentanglement. To avoid synthesizing the general style of speech, we utilize framewise facial embedding. It applied adversarial learning with a multi-scale discriminator for the model to achieve better quality. In addition, the self-attention module is added to focus on content-related features for in-the-wild data. Quantitative experiments show that our method outperforms previous work.

show abstract

“…Voice conversion (VC) aims at transforming the vocal timbre of the source speech to the target speaker while preserving its linguistic content. It has many applications, including movie dubbing [3], speaking assistance [4] and singing [5,6,7]. With the advances of deep learning, neural voice conversion methods have been studied extensively in recent years with highquality natural converted speech [8], such as generative adversarial network (GAN)-based [9,10], variational autoencoder (VAE)-based [11], autoencoder-based [12] and flow-based [13] models, to name a few.…”

Section: Introductionmentioning

confidence: 99%

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Xue¹,

Yang²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN [1] and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM [2] module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noiseindependent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers.

show abstract

Voice Conversion for Dubbing Using Linear Predictive Coding and Hidden Markov Model

Cited by 7 publications

References 11 publications

Adapting Pretrained Models for Adult to Child Voice Conversion

Adapting Pretrained Models for Adult to Child Voice Conversion

Zero-Shot Face-Based Voice Conversion: Bottleneck-Free Speech Disentanglement in the Real-World Scenario

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Contact Info

Product

Resources

About