Novel Pre-processing using Outlier Removal in Voice Conversion

Rao, Sushant V.; Shah, Nirmesh J.; Patil, Hemant A.

doi:10.21437/ssw.2016-22

Cited by 8 publications

(2 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, equalizing formant locations using Dynamic Frequency Warping (DFW) was proposed to tackle these issue [6]. In addition, some of the approaches proposed to filter out such pairs from the training [7,8]. However, loosing number of pairs will not be useful in the case where the amount of training data is small.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion

2018

View full text Add to dashboard Cite

In the non-parallel Voice Conversion (VC) with the Iterative combination of Nearest Neighbor search step and Conversion step Alignment (INCA) algorithm, the occurrence of one-tomany and many-to-one pairs in the training data will deteriorate the performance of the stand-alone VC system. The work on handling these pairs during the training is less explored. In this paper, we establish the relationship via intermediate speaker-independent posteriorgram representation, instead of directly mapping the source spectrum to the target spectrum. To that effect, a Deep Neural Network (DNN) is used to map the source spectrum to posteriorgram representation and another DNN is used to map this posteriorgram representation to the target speaker's spectrum. In this paper, we propose to use unsupervised Vocal Tract Length Normalization (VTLN)based warped Gaussian posteriorgram features as the speakerindependent representations. We performed experiments on a small subset of publicly available Voice Conversion Challenge (VCC) 2016 database. We obtain the lower Mel Cepstral Distortion (MCD) values with the proposed approach compared to the baseline as well as the supervised phonetic posteriorgram feature-based speaker-independent representations. Furthermore, subjective evaluation gave relative improvement of 13.3 % with the proposed approach in terms of Speaker Similarity (SS).

show abstract

Section: Introductionmentioning

confidence: 99%

Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion

2018

View full text Add to dashboard Cite

show abstract

“…Stand-alone VC techniques that are based on Gaussian Mixture Model (GMM) [2,3], frequency warping (FW) [4,5], exemplar [6] and Deep Neural Network (DNN) [7][8][9] requires the aligned spectral features before learning the mapping function. In the VC literature, it has been shown that the alignment accuracy clearly affects the quality of converted speech signal [10][11][12]. Hence, the accurate aligned spectral features from both the source and the target speakers' training speech database are required.…”

Section: Introductionmentioning

confidence: 99%

Effectiveness of Dynamic Features in INCA and Temporal Context-INCA

Shah

Patil

2018

Interspeech 2018

View full text Add to dashboard Cite

Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that both the speakers may have uttered different utterances from same or the different languages. Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) and its variant Temporal Context (TC)-INCA are popular unsupervised alignment algorithms. The INCA and TC-INCA iteratively learn the mapping function after getting the Nearest Neighbor (NN) aligned pairs from the intermediate converted and the target spectral features. In this paper, we propose to use dynamic features along with static features to calculate the NN aligned pairs in both the INCA and TC-INCA algorithms (since the dynamic features are known to play a key role to differentiate major phonetic categories). We obtained on an average relative improvement of 13.75 % and 5.39 % with our proposed Dynamic INCA and Dynamic TC-INCA, respectively. This improvement is also positively reflected in the quality of converted voices.

show abstract

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

Shah

Patil

2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Novel Pre-processing using Outlier Removal in Voice Conversion

Cited by 8 publications

References 21 publications

Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion

Unsupervised Vocal Tract Length Warped Posterior Features for Non-Parallel Voice Conversion

Effectiveness of Dynamic Features in INCA and Temporal Context-INCA

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

Contact Info

Product

Resources

About