2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016
DOI: 10.1109/icassp.2016.7472618
|View full text |Cite
|
Sign up to set email alerts
|

End-to-end attention-based large vocabulary speech recognition

Abstract: Many of the current state-of-the-art Large Vocabulary Continuous Speech Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov Models (HMMs). Most of these systems contain separate components that deal with the acoustic modelling, language modelling and sequence decoding. We investigate a more direct approach in which the HMM is replaced with a Recurrent Neural Network (RNN) that performs sequence prediction directly at the character level. Alignment between the input features and the des… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
720
1
3

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 976 publications
(757 citation statements)
references
References 23 publications
(39 reference statements)
2
720
1
3
Order By: Relevance
“…The word "ROCK" is corrected to "DRAW" after hearing "RATE" and "IN DRAW RATE" to "AND DRAW CROWD" while hearing "PEOPLE". [9] CTC + Trigram (extended) 7.34% Miao et al [9] CTC + Trigram 9.07% Hannun et al [8] CTC + Bigram 14.1% Bahdanau et al [10] Encoder-decoder + Trigram 11.3% Woodland et al [21] GMM-HMM + Trigram 9.46% Miao et al [9] DNN-HMM + Trigram 7.14% is roughly 0.5% to 1% WER. However, there was little difference when the beam width increases from 512 to 2048 in our preliminary experiments.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The word "ROCK" is corrected to "DRAW" after hearing "RATE" and "IN DRAW RATE" to "AND DRAW CROWD" while hearing "PEOPLE". [9] CTC + Trigram (extended) 7.34% Miao et al [9] CTC + Trigram 9.07% Hannun et al [8] CTC + Bigram 14.1% Bahdanau et al [10] Encoder-decoder + Trigram 11.3% Woodland et al [21] GMM-HMM + Trigram 9.46% Miao et al [9] DNN-HMM + Trigram 7.14% is roughly 0.5% to 1% WER. However, there was little difference when the beam width increases from 512 to 2048 in our preliminary experiments.…”
Section: Methodsmentioning
confidence: 99%
“…Also, a sub-lexical language model is proposed in [5] for detecting previously unseen words. RNN-based character-level end-to-end ASR systems were studied in [6,7,8,9,10]. However, they lack the capability of dictating OOV words since the decoding is performed with word-level LMs.…”
Section: Introductionmentioning
confidence: 99%
“…In this sense, this task is similar to aspect-based sentiment analysis (Pontiki et al, 2016), where the task is not to classify a text or sentence, but an entity within the text. The notion of focus is similar to attention (Bahdanau et al, 2016;Yin et al, 2016), with the difference that attention is learned during training whereas focus is given as an additional input.…”
Section: Approachmentioning
confidence: 99%
“…The N-grams of phrases has dimensions as 1 dimensional, 2 dimensional, 3 dimensional and they called as 'unigrams', 'bigrams', 'trigrams' respectively. The N-Grams are mostly using for the speech pattern recognition [14,15] and identify the particular language. Also text classification requires the N-Grams for effective classification [16].…”
Section: Generate N-gramsmentioning
confidence: 99%