This paper derives a speech parameter generation algorithm for HMM-based speech synthesis, in which speech parameter sequence is generated from HMMs whose observation vector consists of spectral parameter vector and its dynamic feature vectors. In the algorithm, we assume that the state sequence (state and mixture sequence for the multi-mixture case) or a part of the state sequence is unobservable (i.e., hidden or latent). As a result, the algorithm iterates the forward-backward algorithm and the parameter generation algorithm for the case where state sequence is given. Experimental results show that by using the algorithm, we can reproduce clear formant structure from multi-mixture HMMs as compared with that produced from single-mixture HMMs.
In this paper, we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here, we investigate six major aspects of the speaker adaptation: initial models; the amount of the training data for the initial models; the transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms; and combination algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis.
A B S T R A C T This paper describesa mel-cepstral analysis method and its adaptive algorithm. In the proposed method, we apply the criterion used in the unbiased estimation of log spectrum to the spectral model represented by the melcepstral coefficients. To solve the non-linear minimization problem involved in the method, we give an iterative algorithm whose convergence is guaranteed. Furthermore, we derive an adaptive algorithm for the mel-cepstral analysis by introducing an instantaneous estimate for gradient of the criterion. The adaptive mel-cepstral analysis system is implemented with an IIR adaptive filter which has an exponential transfer function, and whose stability is guaranteed. We also present examples of speech analysis and results of an isolated word recognition experiment.
I N T R O D U C T I O NThe spectrum represented by the mel-cepstral coefficients have frequency resolution similar to that of the human ear which has high resolution at low frequencies [l]. As a result, mel-cepstral coefficients are useful for speech synthesis and recognition. For obtaining mel-cepstral coefficients, several methods have been proposed. For example, the mel-cepstral coefficients are obtained from the LPC coefficients by using the technique of spectral resampling. No strict method, however, is proposed in which the spectral model is represented by mel-cepstral coefficients and a spectral criterion is minimized.In this paper, we propose a mel-cepstral analysis method and its adaptive algorithm. In the mel-cepstral analysis method, the model spectrum is represented by the M-th order mel-cepstral coefficients and the criterion used in the unbiased estimation of log spectrum[2] is minimized with respect to the mel-cepstral coefficients. The minimization problem is solved efficiently by an iterative technique using the FFT, recursion formulas, and a fast algorithm that requires O ( M Z ) arithmetic operations. We can show that the convergence is quadratic and typically a few iterations are sufficient to obtain the solution.Furthermore, we present an adaptive algorithm for the mel-cepstral analysis. To derive the adaptive algorithm, we introduce an instantaneous estimate for the gradient of the criterion in a similar manner of the LMS algorithm [3].The adaptive analysis system is implemented with an IIR adaptive filter which has the structure of the MLSA filter We show examples of analysis for synthetic and speech signal. To evaluate the proposed methods, an isolated word recognition experiment was carried out.
S P E C T R A L E S T I M A T I O N B A S E D O N M E L -C E P S T R A L R E P R E S E N T A T I O N
Our concept of boron neutron capture therapy (BNCT) is selective destruction of tumor cells using the heavy-charged particles yielded through 10B(n, alpha)7 Li reactions. To design a new protocol that employs epithermal neutron beams in the treatment of glioma patients, we examined the relationship between the radiation dose, histological tumor grade, and clinical outcome. Since 1968, 183 patients with different kinds of brain tumors were treated by BNCT; for this retrospective study, we selected 105 patients with glial tumors who were treated in Japan between 1978 and 1997. In the analysis of side effects due to radiation, we included all the 159 patients treated between 1977 and 2001. With respect to the radiation dose (i.e. physical dose of boron n-alpha reaction), the new protocol prescribes a minimum tumor volume dose of 15 Gy or, alternatively, a minimum target volume dose of 18 Gy. The maximum vascular dose should not exceed 15 Gy (physical dose of boron n-alpha reaction) and the total amount of gamma rays should remain below 10 Gy, including core gamma rays from the reactor and capture gamma in brain tissue. The outcomes for 10 patients who were treated by the new protocol using a new mode composed of thermal and epithermal neutrons are reported.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.