Takashi NOSE†a) , Member, Yuhei OTA †b) , Nonmember, and Takao KOBAYASHI †c) , Member
SUMMARYWe propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available. key words: voice conversion, F0 quantization, prosodic context, nonparallel data
IntroductionRecent developments in statistical parametric speech processing have provided us many useful and beneficial applications in speech recognition and speech synthesis. Voice conversion is one of such attractive applications which can change nonlinguistic or paralinguistic information, e.g., speaker individuality or emotional expressions appearing in speech. The demands for voice conversion applications are increasing in many fields, such as in entertainment [1], foreign language education [2], and software for the physically challenged [3].In this context, a variety of techniques have been proposed [4]. The techniques that have been widely studied so far are based on statistical mapping of spectral features at the frame level using a probabilistic model, i.e., Gaussian mixture model (GMM) [5], [6]. Although a source speaker's spectral features can be easily converted so as to be closer to those of a target speaker by using the GMMbased framework, there are some problems, such as the requirement of parallel data, over-smoothing effect, and insufficient prosody conversion. Recently, several approaches have been proposed to overcome these problems. In [7], spectral mapping with nonparallel training data was proposed by introducing hidden Markov model (HMM)-based modeling and adaptation with phonetic information. The over-smoothing effect is alleviated by introducing global variance (GV) parameters into the estimation of a parameter trajectory [8]. For the prosody conversion, nonlinear modification of fundamental frequency (F0) has been proposed based on multi-space distribution GMM (MSD-GMM) [9]. However, in the above techniques, it is not easy to appropriately convert segmental or supra-segmental speaker characteristics because no phonetic or prosodic context is taken into account in the model ...