Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

Nakamura, Kazuhiro; Hashimoto, Kei; Nankaku, Yoshihiko; Tokuda, Keiichi

doi:10.1587/transinf.e97.d.1438

Cited by 4 publications

(3 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior to the recent use of neural networks in TTS, some authors tried to implement the end-to-end concept, sometimes regarded as waveform models [21], [22].…”

Section: A End-to-end Methodsmentioning

confidence: 99%

Speech Synthesis Based on Deep Neural Networks with Direct Modeling of Amplitude Spectra

Maia¹,

Seara²

2018

Anais De XXXVI Simpósio Brasileiro De Telecomunicações E Processamento De Sinais

View full text Add to dashboard Cite

In recent state-of-the-art text-to-speech systems, usually a sequence of graphemes is directly mapped onto the speech waveform using deep neural networks. Despite reaching very high quality, these approaches tend to be computationally costly at synthesis time and its training implementation is usually not trivial. In this paper, a method which can be interpreted as a simplified version of these systems is proposed. Here, framebased smoothed log spectra, fundamental frequency, and phase information are modeled at training time, while synthesis runs in a straightforward fashion. Experiments show that the proposed approach outperforms traditional ones using acoustic modeling of speech features.

show abstract

“…Prior to the recent use of neural networks in TTS, some authors tried to implement the end-to-end concept, sometimes regarded as waveform models [21], [22].…”

Section: A End-to-end Methodsmentioning

confidence: 99%

Speech Synthesis Based on Deep Neural Networks with Direct Modeling of Amplitude Spectra

Maia¹,

Seara²

2018

Anais De XXXVI Simpósio Brasileiro De Telecomunicações E Processamento De Sinais

View full text Add to dashboard Cite

show abstract

“…(14). By back-propagating the derivative of the log likelihood function through the network, the network weights can be updated to maximize the log likelihood.…”

Section: By Assumingmentioning

confidence: 99%

“…the log spectral distortion-version of minimum generation error training (MGE-LSD) [11], statistical vocoder (STAVOCO) [12], waveform-level statistical model [13], and mel-cepstral analysis-integrated hidden Markov models (HMMs) [14]. However, there are limitations in these approaches, such as the use of spectra rather than waveforms, the use of overlapping and shifting frames as unit, and fixing decision trees [15], which represent the mapping from linguistic features to acoustic ones [16].…”

Section: Introductionmentioning

confidence: 99%

Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

Tokuday

Zen

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper proposes a novel approach for directly-modeling speech at the waveform level using a neural network. This approach uses the neural network-based statistical parametric speech synthesis framework with a specially designed output layer. As acoustic feature extraction is integrated to acoustic model training, it can overcome the limitations of conventional approaches, such as two-step (feature extraction and acoustic modeling) optimization, use of spectra rather than waveforms as targets, use of overlapping and shifting frames as unit, and fixed decision tree structure. Experimental results show that the proposed approach can directly maximize the likelihood defined at the waveform domain.Index Terms-Statistical parametric speech synthesis; neural network; adaptive cepstral analysis.

show abstract

User Generated Dialogue Systems: uDialogue

Tokuda

Lee

Nankaku

et al. 2017

Human-Harmonized Information Technology, Volume 2

View full text Add to dashboard Cite

Integration of Spectral Feature Extraction and Modeling for HMM-Based Speech Synthesis

Cited by 4 publications

References 21 publications

Speech Synthesis Based on Deep Neural Networks with Direct Modeling of Amplitude Spectra

Speech Synthesis Based on Deep Neural Networks with Direct Modeling of Amplitude Spectra

Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

User Generated Dialogue Systems: uDialogue

Contact Info

Product

Resources

About