Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-1460
|View full text |Cite
|
Sign up to set email alerts
|

The IBM 2016 English Conversational Telephone Speech Recognition System

Abstract: We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

5
104
0
1

Year Published

2016
2016
2020
2020

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 105 publications
(110 citation statements)
references
References 24 publications
(47 reference statements)
5
104
0
1
Order By: Relevance
“…Finally, novel acoustic models, especially the deep models, require long training and experimental turnaround time. While most research groups in industry [53,23,118,42,119,120,121] have the computational resource and large amount of training data,…”
Section: A1 Motivationmentioning
confidence: 99%
“…Finally, novel acoustic models, especially the deep models, require long training and experimental turnaround time. While most research groups in industry [53,23,118,42,119,120,121] have the computational resource and large amount of training data,…”
Section: A1 Motivationmentioning
confidence: 99%
“…To achieve this, several previous studies train a speaker independent DNN using many speech samples spoken by many speakers [3][4][5][6][7][8][9][10][11][12][13][14]. Meanwhile, in other speech applications, model specialization to the target speaker has succeeded [15,16]. In text-to-speech synthesis (TTS), the target speaker model is trained using samples spoken by a target speaker, and that has achieved high performance [15].…”
Section: Introductionmentioning
confidence: 99%
“…D EEP Learning has significantly advanced the state-ofthe-art in speech recognition over the past few years [1]- [3]. Most speech recognisers now employ the neural network and hidden Markov model (NN/HMM) hybrid architecture, first investigated in the early 1990s [4], [5].…”
Section: Introductionmentioning
confidence: 99%