HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis

Yamagishi, Junichi; Ogata, K.; Nakano, Y.; Isogai, J.; Kobayashi, Takao

doi:10.1109/icassp.2006.1659961

Cited by 18 publications

(11 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The procedure of warping factor generation by GMM-based method is illustrated in Figure 3. (3) where A is the transform matrix to estimate, B is the bias vector, is original model mean and is the adapted mean. The transform matrix A is estimated by maximizing likelihood of adaptation data O from target speaker,…”

Section: Frequency Warping For Speaker Adaptation Of Speech Synthmentioning

confidence: 99%

“…Voice transformation is generally used to convert the speech synthesized by unit selection based waveform concatenation TTS system. Speaker adaption adjusts the parameters of Maximum a posterior (MAP), maximum likelihood linear regression (MLLR) and speaker adaptive training (SAT), which are originally developed for automatic speech recognition (ASR), are applied to change the voice characteristics of the speech generated by statistical parametric speech synthesis systems [3,4].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Frequency warping for speaker adaption of text-to-speech synthesis

Gao

Cao

2010

IET 3rd International Conference on Wireless, Mobile and Multimedia Networks (ICWMMN 2010)

View full text Add to dashboard Cite

Vocal tract length normalization (VTLN) is generally used in speech recognition for removing individual speaker characteristics. In this paper, we employ VTLN to speaker adaptation of speech synthesis. We propose a new frequency warping approach to reduce the spectrum distance between source and target speakers. The frequency warping function is based on a bilinear function and the warping factor is dynamically generated frame-by-frame. The warped spectra of source speaker are then converted to LSPs to train hidden Markov models (HMM). HMMs are further adapted by maximum likelihood linear regression (MLLR) with target warping approach can make the warped spectra of source speaker closer to target speaker and the resultant adapted HMMs have a better performance than the HMMs trained with unwarped spectra in term of voice naturalness and speaker similarity.

show abstract

Section: Frequency Warping For Speaker Adaptation Of Speech Synthmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Frequency warping for speaker adaption of text-to-speech synthesis

Gao

Cao

2010

IET 3rd International Conference on Wireless, Mobile and Multimedia Networks (ICWMMN 2010)

View full text Add to dashboard Cite

show abstract

“…Recently, voice conversion in the framework of hidden Markov model based speech synthesis has also become a popular topic (e.g. [5]). …”

Section: Introductionmentioning

confidence: 99%

LSF mapping for voice conversion with very small training sets

Helander

Nurminen

Gabbouj

2008

2008 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

To make voice conversion usable in practical applications, the number of training sentences should be minimized. With traditional Gaussian mixture model (GMM) based techniques small training sets lead to over-fitting and estimation problems. We propose a new approach for mapping line spectral frequencies (LSFs) representing the vocal tract. The idea is based on inherent intra-frame correlations of LSFs. For each target LSF, a separate GMM is used and only the source and target LSF elements best correlating with the current LSF are used in training. The proposed method is evaluated both objectively and in listening tests, and it is shown that the method outperforms the conventional GMM approach especially with very small training sets.

show abstract

“…Several adaptation algorithms have been borrowed from speech recognition and further developed [10] for HMM-based speech synthesis. Since the purpose of speaker adaptation for speech synthesis is different from that for speech recognition, a speech synthesis-specific adaptation algorithm, called Minimum Generation Error Linear Regression (MGELR), has also been proposed [13].…”

Section: From Intra-lingual To Crossmentioning

confidence: 99%

“…This is achieved by modifying the HMM parameters using model adaptation technique. Several model adaptation algorithms, which were originally proposed for speech recognition, including Maximum a Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR) [7], Constrained MLLR (CMLLR) [8], and so on, have been applied to HMM-based speech synthesis [9,10]. It has been demonstrated that speaker adaptation of an "Average Voice" model [11] is superior to speaker adaptation of a speaker-dependent model.…”

Section: Introductionmentioning

confidence: 99%

Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis

King

Tokuda

2008

2008 6th International Symposium on Chinese Spoken Language Processing

View full text Add to dashboard Cite

This paper explores a cross-lingual speaker adaptation technique for HMM-based speech synthesis, where a source voice model for English is transformed into a target speaker model using Mandarin Chinese speech data from the target speaker. A phone mappingbased method is adopted to map Chinese Initial/Finals into English phonemes and two types of mapping rules, including one-to-one and one-to-sequence mappings, are compared. In order to avoid having to map prosodic features between languages, the adaptation procedure uses regression classes and transforms that are constructed for triphone models, then used to adapt the phonetic-and-prosodiccontext-dependent models. From the experimental results, we found that a one-to-sequence phone mapping is better than a one-to-one mapping, and that the similarity between adapted English speech and target Chinese speaker is reasonable.

show abstract

HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis

Cited by 18 publications

References 10 publications

Frequency warping for speaker adaption of text-to-speech synthesis

Frequency warping for speaker adaption of text-to-speech synthesis

LSF mapping for voice conversion with very small training sets

Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis

Contact Info

Product

Resources

About