2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2015
DOI: 10.1109/asru.2015.7404790
|View full text |Cite
|
Sign up to set email alerts
|

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

Abstract: The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predict… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

7
521
2
9

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 608 publications
(539 citation statements)
references
References 35 publications
7
521
2
9
Order By: Relevance
“…The other systems perform sentence-wise offline decoding with bidirectional RNNs. The best result was achieved by Miao et al [9] with a CTC-trained deep bidirectional LSTM network and a retrained trigram LM with extended vocabulary. The systems with the original trigram model provided with the WSJ corpus perform worse than our ISR system with character-level RNN LM.…”
Section: Methodsmentioning
confidence: 92%
See 2 more Smart Citations
“…The other systems perform sentence-wise offline decoding with bidirectional RNNs. The best result was achieved by Miao et al [9] with a CTC-trained deep bidirectional LSTM network and a retrained trigram LM with extended vocabulary. The systems with the original trigram model provided with the WSJ corpus perform worse than our ISR system with character-level RNN LM.…”
Section: Methodsmentioning
confidence: 92%
“…The word "ROCK" is corrected to "DRAW" after hearing "RATE" and "IN DRAW RATE" to "AND DRAW CROWD" while hearing "PEOPLE". [9] CTC + Trigram (extended) 7.34% Miao et al [9] CTC + Trigram 9.07% Hannun et al [8] CTC + Bigram 14.1% Bahdanau et al [10] Encoder-decoder + Trigram 11.3% Woodland et al [21] GMM-HMM + Trigram 9.46% Miao et al [9] DNN-HMM + Trigram 7.14% is roughly 0.5% to 1% WER. However, there was little difference when the beam width increases from 512 to 2048 in our preliminary experiments.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Another important feature is that the modeling units are directly phones or characters, resulting in simplification of the ASR system. The decoder integrated the word-level language model into weight finite state transducers (WFSTs) is introduced in [5], which makes the performance under the new framework comparable with that of the traditional HMM/DNN system and speeds up decoding significantly due to the contextindependent (CI) phone modeling. In [6], the role of the blank symbol is extensively discussed upon a similar challenge for handwritten recognition.…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we investigate to use adaptive per-dimensional learning rate methods to improve the performance of ASR based on the end-toend framework, ADAGRAD and ADADELTA [8] included. Followed the work in [5], deep bidirectional long shortterm memory (LSTM) recurrent neural networks (RNNs) are used for acoustic modeling. The RNN is trained to learn the CI phone labels given sequences of speech feature.…”
Section: Introductionmentioning
confidence: 99%