Transformation of speaker characteristics for voice conversion

Rentzos, Dimitrios; Vaseghi, Saeed; Turajlić, Emir; Yan, Qing; Ho, Ching-Hsiang

doi:10.1109/asru.2003.1318526

Cited by 17 publications

(10 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All combinations between one female and two male speakers showed that higher formants were better converted than lower ones. Rentzos et al (2003) used the formant transformation by a two-dimensional phoneme-dependent hidden Markov models (HMM), glottal pulse LF model transformation, and pitch transformation based on time-domain pitch-synchronous overlapand-add (TD-PSOLA) method. Poles of the LPC model were used for formant estimation.…”

Section: Overview Of Voice Conversion Methodsmentioning

confidence: 99%

Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

Přibilová

Přibil

2006

Speech Communication

View full text Add to dashboard Cite

Section: Overview Of Voice Conversion Methodsmentioning

confidence: 99%

Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

Přibilová

Přibil

2006

Speech Communication

View full text Add to dashboard Cite

“…Voice conversion (VC) is a technique used to modify paralinguistic factors of an utterance from a source speaker to sound like a target speaker. Para-linguistic factors include speaker identity [1], prosody [2] and accent [3], etc. In this paper, we focus on the conversion of speaker identity across arbitrary speakers under a one-shot scenario [4,5], i.e., given only one target speaker's utterance for reference.…”

Section: Introductionmentioning

confidence: 99%

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Wang¹,

Deng²,

Yeung³

et al. 2021

Preprint

View full text Add to dashboard Cite

One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally ignores the correlation between different speech representations during training, which causes leakage of content information into the speaker representation and thus degrades VC performance. To alleviate this issue, we employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations for retaining source linguistic content and intonation variations, while capturing target speaker characteristics. In doing so, the proposed approach achieves higher speech naturalness and speaker similarity than current state-of-the-art one-shot VC systems. Our code, pre-trained models and demo are available at https://github.com/Wendison/VQMIVC.

show abstract

“…Voice conversion (VC) is a task aimed at converting the speech signals from a certain acoustic domain to another while keeping the linguistic content the same. Examples of acoustic domains include not only speaker identity [1,2,3,4], but many other factors orthogonal to the linguistic content, such as speaking style, speaking rate [5], noise condition, emotion [6,7], and accent [8], with potential applications ranging from speech enhancement [9,10], computer-assisted pronunciation training for non-native language learner [8], speaking assistance [11], to name a few. This paper focuses on using VC to improve the speech intelligibility of surgical patients who have had parts of their articulators removed.…”

Section: Introductionmentioning

confidence: 99%

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

2019

View full text Add to dashboard Cite

This paper 1 focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed. Due to the difficulty of data collection, VC without parallel data is highly desired. Although techniques for unparallel VC-for example, CycleGAN-have been developed, they usually focus on transforming the speaker identity, and directly transforming the speech of one speaker to that of another speaker and as such do not address the task here. In this paper, we propose a new approach for unparallel VC. The proposed approach transforms impaired speech to normal speech while preserving the linguistic content and speaker characteristics. To our knowledge, this is the first end-to-end GAN-based unsupervised VC model applied to impaired speech. The experimental results show that the proposed approach outperforms CycleGAN.

show abstract

Transformation of speaker characteristics for voice conversion

Cited by 17 publications

References 4 publications

Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

Contact Info

Product

Resources

About