Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

Yamagishi, Junichi; Kobayashi, Takao

doi:10.1093/ietisy/e90-d.2.533

Cited by 142 publications

(119 citation statements)

References 0 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…This section aims to explain the predominant statistical modeling approach applied in speech synthesis, i.e., context-dependent multi-space probability distribution left-to-right without skip transitions HSMM [3,14] (simply called HSMM in the remainder of this paper). The discussion presented in this section provides a preliminary framework which will be used as a basis to introduce the proposed HMEM technique in Section 3.…”

Section: Hsmm-based Speech Synthesismentioning

confidence: 99%

“…Also, α t (i) and βt(i) are partial forward and backward probability variables that are calculated successively from their previous or next values as follows [3,14]:…”

Section: Hsmm Likelihoodmentioning

confidence: 99%

“…It is mainly due to SPSS advantages over traditional concatenative speech synthesis approaches; these advantages include the flexibility to change voice characteristics [3][4][5], multilingual support [6][7][8], coverage of acoustic space [1], small footprint [1], and robustness [4,9]. All of the above advantages stem from the fact that SPSS provides a statistical model for acoustic features instead of using original speech waveforms.…”

Section: Introductionmentioning

confidence: 99%

“…This latter method exploits an invaluable prior knowledge attained from an average voice model [3], and adapts this general model using an adaptation algorithm such as maximum likelihood linear regression (MLLR) [32], maximum a posteriori (MAP) [33], and cluster adaptive training (CAT) [21]. However, working with average voice models is difficult for under-resourced languages since building such general model needs remarkable efforts to design, record, and transcribe a thorough multi-speaker speech database [3]. To alleviate the data sparsity problem in under-resourced languages, speaker and language factorization (SLF) technique can be used [34].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Khorram

Sameti

Bahmaninezhad

et al. 2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.

show abstract

Section: Hsmm-based Speech Synthesismentioning

confidence: 99%

“…Also, α t (i) and βt(i) are partial forward and backward probability variables that are calculated successively from their previous or next values as follows [3,14]:…”

Section: Hsmm Likelihoodmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Khorram

Sameti

Bahmaninezhad

et al. 2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…The mean vectors and covariance matrices of state output distributions of the target speakers model are obtained by linearly transforming the mean vectors and covariance matrices of state output distributions of the source speaker's model [16]. The same idea lies for CMLLR.…”

Section: Conception Of the Speech Synthesizersmentioning

confidence: 99%