2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2011
DOI: 10.1109/icassp.2011.5947401
|View full text |Cite
|
Sign up to set email alerts
|

Large vocabulary continuous speech recognition with context-dependent DBN-HMMS

Abstract: The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice search task. Our system achieves absolute sentence accuracy improvements of 5.8% and 9.2% over GMM-HMMs trained us… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
73
0

Year Published

2012
2012
2015
2015

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 138 publications
(73 citation statements)
references
References 14 publications
(14 reference statements)
0
73
0
Order By: Relevance
“…In recent years, deep learning models have been used for phonetic classification and recognition on a variety of speech tasks and showed promising results [7,8]. A Deep Boltzmann Machine is a network of symmetrically coupled stochastic binary units [6,9].…”
Section: Deep Boltzmann Machinesmentioning
confidence: 99%
“…In recent years, deep learning models have been used for phonetic classification and recognition on a variety of speech tasks and showed promising results [7,8]. A Deep Boltzmann Machine is a network of symmetrically coupled stochastic binary units [6,9].…”
Section: Deep Boltzmann Machinesmentioning
confidence: 99%
“…The resulting deep belief nets learn a hierarchy of nonlinear feature detectors that can capture complex statistical patterns in data. The deep belief net training algorithm suggested in [24] first initializes the weights of each layer individually in a purely unsupervised 1 way and then fine-tunes the entire network using labeled data. This semi-supervised approach using deep models has proved effective in a number of applications, including coding and classification for speech, audio, text, and image data ( [25]- [29]).…”
Section: Introductionmentioning
confidence: 99%
“…The revolution took place in 2010 after the close collaboration between academic and industrial research groups, including the University of Toronto, Microsoft, and IBM [1,4,5]. This research found that very significant performance improvements can be accomplished with the NN-based hybrid approach, with a few novel techniques and design choices: (1) extending NNs to DNNs, i.e., involving a large number of hidden layers (usually 4 to 8); (2) employing appropriate initialization methods, e.g., pre-training with restricted Boltzmann machines (RBMs); and (3) using fine-grained NN targets, e.g., context-dependent states.…”
Section: Introductionmentioning
confidence: 99%