A segment-based approach to voice conversion

Abe, Masanobu

doi:10.1109/icassp.1991.150451

Cited by 26 publications

(14 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

Section: Introductionmentioning

confidence: 99%

“…As a result, the conversion performance is not always satisfactory, and highly depends on the combination of the source and target speakers. To alleviate this problem, a segment-based voice conversion using unit selection is one of the effective approaches where the dynamic characteristics of speaker individuality is converted as well as the static ones [10]. In the technique, the phone units are used as the segments, and a mapping table is generated between triphones of the source and target speakers.…”

Section: Introductionmentioning

confidence: 99%

“…In the technique, the phone units are used as the segments, and a mapping table is generated between triphones of the source and target speakers. Although this unitselection-based approach significantly improved the conventional VQ-based frame-by-frame mapping one [11], it was also pointed out that a large amount of speech data of the source and target speakers must be prepared to achieve high conversion performance [10].In this paper, we propose a novel voice conversion technique for converting segmental and supra-segmental features by introducing an HMM-based speech synthesis with phonetic and prosodic contexts [12]. The basic idea of our technique comes from the HMM-based phonetic vocoder [13], which was proposed for very low bitrate speech coding.…”

mentioning

confidence: 99%

See 2 more Smart Citations

HMM-Based Voice Conversion Using Quantized F0 Context

Nose

Ota

Kobayashi

2010

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Takashi NOSE†a) , Member, Yuhei OTA †b) , Nonmember, and Takao KOBAYASHI †c) , Member SUMMARYWe propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available. key words: voice conversion, F0 quantization, prosodic context, nonparallel data IntroductionRecent developments in statistical parametric speech processing have provided us many useful and beneficial applications in speech recognition and speech synthesis. Voice conversion is one of such attractive applications which can change nonlinguistic or paralinguistic information, e.g., speaker individuality or emotional expressions appearing in speech. The demands for voice conversion applications are increasing in many fields, such as in entertainment [1], foreign language education [2], and software for the physically challenged [3].In this context, a variety of techniques have been proposed [4]. The techniques that have been widely studied so far are based on statistical mapping of spectral features at the frame level using a probabilistic model, i.e., Gaussian mixture model (GMM) [5], [6]. Although a source speaker's spectral features can be easily converted so as to be closer to those of a target speaker by using the GMMbased framework, there are some problems, such as the requirement of parallel data, over-smoothing effect, and insufficient prosody conversion. Recently, several approaches have been proposed to overcome these problems. In [7], spectral mapping with nonparallel training data was proposed by introducing hidden Markov model (HMM)-based modeling and adaptation with phonetic information. The over-smoothing effect is alleviated by introducing global variance (GV) parameters into the estimation of a parameter trajectory [8]. For the prosody conversion, nonlinear modification of fundamental frequency (F0) has been proposed based on multi-space distribution GMM (MSD-GMM) [9]. However, in the above techniques, it is not easy to appropriately convert segmental or supra-segmental speaker characteristics because no phonetic or prosodic context is taken into account in the model ...

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

HMM-Based Voice Conversion Using Quantized F0 Context

Nose

Ota

Kobayashi

2010

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…It has been documented that spectral conversion is a feasible technique for modifying articulation-related parameters of speech (Shikano et al, 1986;Abe et al, 1988;Nakamura and Shikano, 1989;Abe et al, 1990;Abe, 1991;Shikano et al, 1991;Valbret et al, 1992). Spectral conversion was originally used for speaker adaptation in speech recognition systems, where the spectral information is described by a codebook.…”

Section: Introductionmentioning

confidence: 99%

“…The technique of spectral conversion was also used in normal voice conversion systems (Abe et al, 1988;Abe et al, 1990;Abe, 1991). To accomplish the voice conversion, the spectral spaces of an input speaker and a target speaker were reduced to, and represented by two codebooks that were obtained using vector quantization (VQ) techniques.…”

Section: Introductionmentioning

confidence: 99%

Speech conversion and its application to alaryngeal speech enhancement

Qi²

Proceedings of Third International Conference on Signal Processing (ICSP'96)

View full text Add to dashboard Cite

This manuscript ,has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedtbrough, substandard margins, and improper alignment can adversely affect reproduction.In the unlikely. event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and conti m1 ing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

show abstract