Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181
DOI: 10.1109/icassp.1998.675495
|View full text |Cite
|
Sign up to set email alerts
|

A high quality text-to-speech system composed of multiple neural networks

Abstract: While neural networks have been employed to handle several different text-to-speech tasks, ours is the first system to use neural networks throughout, for both linguistic and acoustic processing. We divide the text-to-speech task into three subtasks, a linguistic module mapping from text to a linguistic representation, an acoustic module mapping from the linguistic representation to speech, and a video module mapping from the linguistic representation to animated images. The linguistic module employs a letter-… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 5 publications
0
4
0
Order By: Relevance
“…More recently, following on from successes in automatic speech recognition [9], artificial neural networks have reemerged as acoustic models for SPSS [10]. By the 1990s, artificial neural networks had already been employed as feature extractors from text input to produce linguistic features [11], as acoustic models to map linguistic features to vocoder parameters [12], [13], [14], and to predict segment durations [15]. One prominent theme in more recent studies is the use of neural architectures to replace Gaussian mixture models (GMMs) associated with leaf nodes of decision trees, such as the restricted Boltzmann machines (RBMs) in [16], where RBMs were claimed to better learn spectral detail, resulting in better quality synthesised speech.…”
Section: A Related Workmentioning
confidence: 99%
“…More recently, following on from successes in automatic speech recognition [9], artificial neural networks have reemerged as acoustic models for SPSS [10]. By the 1990s, artificial neural networks had already been employed as feature extractors from text input to produce linguistic features [11], as acoustic models to map linguistic features to vocoder parameters [12], [13], [14], and to predict segment durations [15]. One prominent theme in more recent studies is the use of neural architectures to replace Gaussian mixture models (GMMs) associated with leaf nodes of decision trees, such as the restricted Boltzmann machines (RBMs) in [16], where RBMs were claimed to better learn spectral detail, resulting in better quality synthesised speech.…”
Section: A Related Workmentioning
confidence: 99%
“…A typical statistical parametric TTS pipeline has the following stages: grapheme-to-phoneme conversion, a phoneme duration predictor, an acoustic frame-level feature generator, and a vocoder [11]. Neural networks have been used for TTS in 1990 [12,13,14,15]. The deep learning re-introduced NNs to TTS: Zen et al [16,17] proposed a hybrid NN-parametric TTS model, where deep NNs are used to predict the phoneme duration and to generate frame-level acoustic features.…”
Section: Related Workmentioning
confidence: 99%
“…As a result, for extracting allophones we used a neural network. Because their learning power, neural networks can learn from a database and can recognize allophones properly [35]. Fig.…”
Section: Allophone Based Tts Systemmentioning
confidence: 99%