Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

Toda, Tomoki; Black, Alan W.; Tokuda, Keiichi

doi:10.1109/tasl.2007.907344

Cited by 866 publications

(722 citation statements)

References 30 publications

Supporting

Mentioning

712

Contrasting

Unclassified

Order By: Relevance

“…After applying a weighting matrix W [3] to an input speech parameter sequence x = [x 1 , · · · , x T ] for calculating its static-dynamic speech feature sequence, the DNNs predict a static-dynamic speech feature sequence of the converted speech.ŷ is generated from the static-dynamic features by using the maximum likelihood-based parameter generation algorithm [2]. We define the above speech parameter conversion asŷ = G(x).…”

Section: Conventional Dnn-based Vcmentioning

confidence: 99%

“…Deep Neural Networks (DNNs) [1] have been used as acoustic models for VC because they can represent the relationship between the input and output speech parameters more accurately than conventional Gaussian mixture models [2]. These acoustic models are trained with training algorithms such as the maximum likelihood criterion [3] and Minimum Generation Error (MGE) criterion [4], [5].…”

Section: Introductionmentioning

confidence: 99%

“…The over-smoothing effect is an issue in not only VC but also other speech synthesis techniques, such as text-to-speech synthesis. Hence, several approaches have been devised to reproduce the characteristics of natural speech [2], [6], [7]. On the other hand, VC can utilize not only those approaches, but also input speech information since the input and output parameters are often in the same domain (e.g., cepstrum).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Voice Conversion Using Input-to-Output Highway Networks

Saito

Takamichi

Saruwatari

2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThis paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech. key words: statistical parametric speech synthesis, DNN-based voice conversion, highway networks, over-smoothing

show abstract

Section: Conventional Dnn-based Vcmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Voice Conversion Using Input-to-Output Highway Networks

Saito

Takamichi

Saruwatari

2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…Our proposed enhancement system uses a statistical F 0 pattern prediction, which is a part of voice conversion techniques [10], [11], to predict F 0 patterns of normal speech from spectral features of EL speech. It consists of training and prediction processes as shown in Fig.…”

Section: Statistical F 0 Pattern Predictionmentioning

confidence: 99%

A Vibration Control Method of an Electrolarynx Based on Statistical <i>F</i><sub>0</sub> Pattern Prediction

Tanaka

Toda

Nakamura

2017

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

SUMMARY This paper presents a novel speaking aid system to help laryngectomees produce more naturally sounding electrolaryngeal (EL) speech. An electrolarynx is an external device to generate excitation signals, instead of vibration of the vocal folds. Although the conventional EL speech is quite intelligible, its naturalness suffers from the unnatural fundamental frequency (F 0 ) patterns of the mechanically generated excitation signals. To improve the naturalness of EL speech, we have proposed EL speech enhancement methods using statistical F 0 pattern prediction. In these methods, the original EL speech recorded by a microphone is presented from a loudspeaker after performing the speech enhancement. These methods are effective for some situation, such as telecommunication, but it is not suitable for face-to-face conversation because not only the enhanced EL speech but also the original EL speech is presented to listeners. In this paper, to develop an EL speech enhancement also effective for face-to-face conversation, we propose a method for directly controlling F 0 patterns of the excitation signals to be generated from the electrolarynx using the statistical F 0 prediction. To get an "actual feel" of the proposed system, we also implement a prototype system. By using the prototype system, we find latency issues caused by a real-time processing. To address these latency issues, we furthermore propose segmental continuous F 0 pattern modeling and forthcoming F 0 pattern modeling. With evaluations through simulation, we demonstrate that our proposed system is capable of effectively addressing the issues of latency and those of electrolarynx in term of the naturalness.

show abstract

“…This is also based on the singing-to-singing synthesis approach and is an extension of VocaListener, which deals with only pitch and dynamics. Much previous work has been done on manipulating voice timbre such as speaking voice conversion [12,13], emotional speech synthesis [14][15][16], singing voice conversion [17], and singing voice morphing [18]. However, these approaches cannot deal with intentional temporal timbre changes during singing.…”

Section: Vocalistener2: Singing Synthesis System Imitating Voice Timbmentioning

confidence: 99%

VocaListener and VocaWatcher: Imitating a human singer by using signal processing

Goto

Nakano

Kajita

et al. 2012

2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we describe three singing information processing systems, VocaListener, VocaListener2, and VocaWatcher, that imitate singing expressions of the voice and face of a human singer. VocaListener can synthesize natural singing voices by analyzing and imitating the pitch and dynamics of the human singing. VocaListener2 imitates temporal timbre changes in addition to the pitch and dynamics. In synchronization with the synthesized singing voices, VocaWatcher can generate realistic facial motions of a humanoid robot, the HRP-4C, by analyzing and imitating facial motions of a human singing that are recorded by a single video camera. These systems that focus on "imitation" are not only promising for representing human-like naturalness, but also useful for providing intuitive control means.

show abstract

Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory

Cited by 866 publications

References 30 publications

Voice Conversion Using Input-to-Output Highway Networks

Voice Conversion Using Input-to-Output Highway Networks

A Vibration Control Method of an Electrolarynx Based on Statistical <i>F</i><sub>0</sub> Pattern Prediction

VocaListener and VocaWatcher: Imitating a human singer by using signal processing

Contact Info

Product

Resources

About