2018
DOI: 10.48550/arxiv.1801.00059
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The CAPIO 2017 Conversational Speech Recognition System

Kyu J. Han,
Akshay Chandrashekaran,
Jungsuk Kim
et al.

Abstract: In this paper we show how we have achieved the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set. We propose densely connected LSTMs, (namely, dense LSTMs), inspired by the densely connected convolutional networks recently introduced for image classification tasks. It is shown that the proposed dense LSTMs would provide more reliable performances as compared to the conventional, residual LSTMs as more LSTM layers are stacked in neural networks. We also propose an acous… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
14
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(14 citation statements)
references
References 27 publications
(54 reference statements)
0
14
0
Order By: Relevance
“…Table 1 compares the TDS model with three other systems. The CAPIO system is a hybrid HMM-DNN with speaker adaptation [33]. The other two are end-to-end models, one using the CRF-style ASG loss [31] and the other a sequence-to-sequence model with an RNN encoder [23].…”
Section: Resultsmentioning
confidence: 99%
“…Table 1 compares the TDS model with three other systems. The CAPIO system is a hybrid HMM-DNN with speaker adaptation [33]. The other two are end-to-end models, one using the CRF-style ASG loss [31] and the other a sequence-to-sequence model with an RNN encoder [23].…”
Section: Resultsmentioning
confidence: 99%
“…The error rates on those SWB and CH decrease from 6.5 and 11.9 to 6.2 and 11.4 (Table 2). Our best model is significantly better than previously published CTC [29] and LSTM-based [3] models, and approaches the heavily tuned hybrid system [28] with dense TDNN-LSTM. It is likely possible to reach better error rates, with the help of ensembled models, further data augmentation, and language models.…”
Section: Speech Recognition Resultsmentioning
confidence: 69%
“…be found in telephony speech or readings of audio books. On standard tasks for this scenario, as switchboard and librispeech [1,2], typical WERs are below 10 % . Nevertheless, ASR on noisy data remains challenging.…”
Section: Introductionmentioning
confidence: 96%