HMM-Based Voice Conversion Using Quantized F0 Context

Nose, Takeru; Ota, Yuhei; Kobayashi, Takao

doi:10.1587/transinf.e93.d.2483

Cited by 11 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One typical method is VocaListener [25] in singing voice synthesis, which is capable of automatically optimizing input manual parameters of a singing voice synthesizer using singing voices sung by a user. The VC method using HMM for speaking voices [26] may also be used for this purpose.…”

Section: Speech and Text Inputmentioning

confidence: 99%

Augmented speech production based on real-time statistical voice conversion

Toda

2014

2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

View full text Add to dashboard Cite

In human-to-human speech communication, various barriers are caused by some constraints, such as physical constraints causing vocal disorders and environmental constraints making it hard to produce intelligible speech. These barriers would be overcome if our speech production was augmented so that we could produce speech sounds as we want beyond these constraints. Voice conversion (VC) is a technique for modifying speech acoustics, converting non-/para-linguistic information to any form we want while preserving the linguistic content. One of the most popular approaches to VC is based on statistical processing, which is capable of extracting a complex conversion function in a data-driven manner. Although this technique was originally studied in the context of speaker conversion, which converts the voice of a certain speaker to sound like that of another specific speaker, it has great potential to achieve various applications beyond speaker conversion. This paper briefly reviews a trajectory-based conversion method that is capable of effectively reproducing natural speech parameter trajectories utterance by utterance and highlights several techniques that extend this trajectory-based conversion method to achieve real-time conversion processing. Finally this paper shows some examples of real-time VC applications to enhance human-tohuman speech communication, such as speaking-aid, silent speech communication, and voice changer/vocal effector.

show abstract

Section: Speech and Text Inputmentioning

confidence: 99%

Augmented speech production based on real-time statistical voice conversion

Toda

2014

2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

View full text Add to dashboard Cite

show abstract

“…To improve the accuracy, the authors introduce phone-duration prediction using random forests [6] which is a kind of ensemble training [7]. Finally, speech parameter generation with mora-based emphasis context is presented to preserve rich intonation of natural speech, which is a variation of quantized fundamental frequency (F0) context [8] used also in voice conversion [9] and very low bit-rate speech coding [10].…”

Section: Introductionmentioning

confidence: 99%

Prosodically Rich Speech Synthesis Interface Using Limited Data of Celebrity Voice

Nose¹,

Kamei²

2016

JCC

Self Cite

View full text Add to dashboard Cite

To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize naturalsounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations.

show abstract