A trainable excitation model for HMM-based speech synthesis

Maia, Ranniery; Toda, Tomoki; Zen, Heiga; Nankaku, Yoshihiko; Tokuda, Keiichi

doi:10.21437/interspeech.2007-530

Cited by 26 publications

(32 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Some efforts have been devoted in speech synthesis in order to enhance the quality and naturalness by adopting a more subtle excitation model. In the Codebook Excited Linear Predictive (CELP) approach [4], the residual signal is constructed from a codebook containing several typical excitation frames [5]. The Multi Band Excitation (MBE) modeling [6] suggests to divide the frequency axis in several bands, and a voiced/unvoiced decision is taken for each band at each time.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications

Drugman

Dutoit

2012

IEEE Trans. Audio Speech Lang. Process.

112

View full text Add to dashboard Cite

Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. The stochastic component is a high-pass filtered noise whose time structure is modulated by an energy-envelope, similarly to what is done in the Harmonic plus Noise Model (HNM). The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. In addition the proposed model requires few computational load and memory, which is essential for its integration in commercial applications.

show abstract

Section: Introductionmentioning

confidence: 99%

“…According to the Mixed Excitation (ME) approach [7], the residual signal is the superposition of both a periodic and a non-periodic component. Various models derived from the ME approach have been used in HMM-based speech synthesis [8], [9], [10]. A popular technique used in parametric synthesis is the STRAIGHT vocoder [11].…”

Section: Introductionmentioning

confidence: 99%

The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications

Drugman

Dutoit

2012

IEEE Trans. Audio Speech Lang. Process.

112

View full text Add to dashboard Cite

show abstract

“…• Although we proposed the use of a Principal Component Analysis, other data mining methods (possibly derived from the functional PCA literature, [21]) could be efficiently employed to extract a suitable representation from the large dataset of normalized GCI-centered residual frames (obtained as described in Section 2.1). • Finally, it would certainly be very interesting to compare the proposed approach with other techniques of excitation modeling, such as STRAIGHT [22], the mixed excitation [7], [8], or based on the Liljencrant-Fant model [9]. Although all these approaches reported a relative improvement with regard to the traditional pulse excitation, no comparison is available yet, since authors worked with different synthesis frameworks and with different databases.…”

Section: Discussionmentioning

confidence: 99%

“…In [7], the filter coefficients were derived from bandpass voicing strenghts. In [8], state-dependent highdegree filters were directly trained using a closed loop procedure. The integration of a Liljencrants-Fant waveform as a modeling of the glottal source, possibly producing different voice qualities by varying the LF parameters, was proposed in [9].…”

Section: Introductionmentioning

confidence: 99%

Eigenresiduals for improved Parametric Speech Synthesis

Drugman,

Wilfart,

Dutoit

2020

Preprint

View full text Add to dashboard Cite

Statistical parametric speech synthesizers have recently shown their ability to produce natural-sounding and flexible voices. Unfortunately the delivered quality suffers from a typical buzziness due to the fact that speech is vocoded. This paper proposes a new excitation model in order to reduce this undesirable effect. This model is based on the decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. This basis contains a limited number of eigenresiduals and is computed on a relatively small speech database. A stream of PCAbased coefficients is added to our HMM-based synthesizer and allows to generate the voiced excitation during the synthesis. An improvement compared to the traditional excitation is reported while the synthesis engine footprint remains under about 1Mb.

show abstract

“…We hypothesize that a better NSF source signal for voiced sounds may contain a certain degree of randomness in the short term while preserving long-term periodicity. Although source signals for classical speech vocoders [24,25,26,27] may be used, we focus on source signals that have a simple parametric form and require no additional analysis loop.…”

Section: Cyclic Noise-based Source Signalmentioning

confidence: 99%

Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model

Wang

Yamagishi

2020

Preprint

View full text Add to dashboard Cite

Neural source-filter (NSF) waveform models generate speech waveforms by morphing sine-based source signals through dilated convolution in the time domain. Although the sinebased source signals help the NSF models to produce voiced sounds with specified pitch, the sine shape may constrain the generated waveform when the target voiced sounds are less periodic. In this paper, we propose a more flexible source signal called cyclic noise, a quasi-periodic noise sequence given by the convolution of a pulse train and a static random noise with a trainable decaying rate that controls the signal shape. We further propose a masked spectral loss to guide the NSF models to produce periodic voiced sounds from the cyclic noise-based source signal. Results from a large-scale listening test demonstrated the effectiveness of the cyclic noise and the masked spectral loss on speaker-independent NSF models in copy-synthesis experiments on the CMU ARCTIC database.

show abstract

A trainable excitation model for HMM-based speech synthesis

Cited by 26 publications

References 10 publications

The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications

The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications

Eigenresiduals for improved Parametric Speech Synthesis

Using Cyclic Noise as the Source Signal for Neural Source-Filter-based Speech Waveform Model

Contact Info

Product

Resources

About