2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016
DOI: 10.1109/icassp.2016.7472730
|View full text |Cite
|
Sign up to set email alerts
|

From HMMS to DNNS: Where do the improvements come from?

Abstract: Deep neural networks (DNNs) have recently been the focus of much text-to-speech research as a replacement for decision trees and hidden Markov models (HMMs) in statistical parametric synthesis systems. Performance improvements have been reported; however, the configuration of systems evaluated makes it impossible to judge how much of the improvement is due to the new machine learning methods, and how much is due to other novel aspects of the systems. Specifically, whereas the decision trees in HMM-based system… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
65
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
4
4
2

Relationship

2
8

Authors

Journals

citations
Cited by 61 publications
(65 citation statements)
references
References 15 publications
0
65
0
Order By: Relevance
“…It has been reported that DNN-based techniques have improved the quality of synthetic speech significantly; cf. [10]. A new DNN baseline system [11] added to the Blizzard Challenge 2016 1 also turned out to be significantly better than the standard HMM-based baseline (that uses a toolkit called HTS [12]), again confirming the speech quality improvements brought on by deep learning approaches.…”
Section: Introductionmentioning
confidence: 79%
“…It has been reported that DNN-based techniques have improved the quality of synthetic speech significantly; cf. [10]. A new DNN baseline system [11] added to the Blizzard Challenge 2016 1 also turned out to be significantly better than the standard HMM-based baseline (that uses a toolkit called HTS [12]), again confirming the speech quality improvements brought on by deep learning approaches.…”
Section: Introductionmentioning
confidence: 79%
“…The most popular way to use neural networks in SPSS is with a deep feed-forward neural network (DNN) as a conditional model to map linguistic features to vocoder parameters directly [22], [23], [24], [25], [26]. This can be viewed as replacing the decision tree used in HMM-based speech synthesis with a more powerful regression model [22], [27].…”
Section: A Related Workmentioning
confidence: 99%
“…Neural network based synthesis has recently produced very high-quality voices, and addresses some of the naturalness issues common to HMM-based voices. [17] found that the acrossclass averaging resulting from decision tree based context clustering is a major detractor of naturalness in HMM voice quality, and [18] found that replacing the decision trees with DNNs and the production of frame-level rather than state-level predictions substantially improved naturalness as well. Furthermore, [19] found that an HMM system trained on 100 hours of data was comparable in f0 correlation (an objective measure of naturalness) to a DNN system using only 10 hours.…”
Section: Neural Network Synthesis Experimentsmentioning
confidence: 99%