The IBM 2015 English conversational telephone speech recognition system

Saon, George; Kuo, Hong-Kwang Jeff; Rennie, Steven J.; Picheny, Michael

doi:10.21437/interspeech.2015-632

Cited by 44 publications

(33 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We remind the reader that maxout nets [12] generalize ReLu units by employing non-linearities of the form si = max j∈C(i) w T j x + bj where the subsets of neurons C(i) are typically disjoint. In [11] we have shown that maxout DNNs and CNNs trained with annealed dropout outperform their sigmoid-based counterparts on both 300 hours and 2000 hours training regimes. What was missing there was a comparison between maxout and sigmoid for unfolded RNNs [4].…”

Section: Recurrent Nets With Maxout Activationsmentioning

confidence: 95%

“…The decodings are done with a small vocabulary of 30K words and a small 4-gram language model with 4M n-grams. Note that the sigmoid RNNs have better error rates than what was reported in [11] because they have been retrained after the data has been realigned with the best joint RNN/CNN model. We observe that the maxout RNNs are consistently better and that, by themselves, they achieve a similar WER as our previous best model which was the joint RNN/CNN with sigmoid activations.…”

Section: Recurrent Nets With Maxout Activationsmentioning

confidence: 96%

“…The training and test data, frontend processing, speaker adaptation are identical to [11] and their description will be omitted. At the end of the section, we also provide an update on our vocabulary and language modeling experiments.…”

Section: System Improvementsmentioning

confidence: 99%

“…Our language modeling strategy largely parallels that described in [11]. For completeness, we will repeat some of the details here.…”

Section: Language Modeling Experimentsmentioning

confidence: 99%

See 3 more Smart Citations

The IBM 2016 English Conversational Telephone Speech Recognition System

Saon¹,

Sercu²,

Rennie³

et al. 2016

Interspeech 2016

Self Cite

105

104

View full text Add to dashboard Cite

We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.

show abstract

Section: Recurrent Nets With Maxout Activationsmentioning

confidence: 95%

Section: Recurrent Nets With Maxout Activationsmentioning

confidence: 96%

Section: System Improvementsmentioning

confidence: 99%

“…Our language modeling strategy largely parallels that described in [11]. For completeness, we will repeat some of the details here.…”

Section: Language Modeling Experimentsmentioning

confidence: 99%

See 2 more Smart Citations

The IBM 2016 English Conversational Telephone Speech Recognition System

Saon¹,

Sercu²,

Rennie³

et al. 2016

Interspeech 2016

Self Cite

105

104

View full text Add to dashboard Cite

show abstract

“…The training is accomplished using the IBM Attila toolkit [24] on 600 hours of conversational telephone speech (CTS) data from the Fisher corpus [25] with a 9-frame context of 40-dimensional speaker-adapted feature vectors obtained using per recording fMLLR transforms [16,17]. The fM-LLR transforms are generated for each recording with decoding alignments obtained from a GMM-HMM acoustic model (see [26,27] for more details).…”

Section: Dnn System Configurationmentioning

confidence: 99%

The IBM Speaker Recognition System: Recent Advances and Error Analysis

2016

View full text Add to dashboard Cite

We present the recent advances along with an error analysis of the IBM speaker recognition system for conversational speech. Some of the key advancements that contribute to our system include: a nearest-neighbor discriminant analysis (NDA) approach (as opposed to LDA) for intersession variability compensation in the i-vector space, the application of speaker and channel-adapted features derived from an automatic speech recognition (ASR) system for speaker recognition, and the use of a DNN acoustic model with a very large number of output units (∼10k senones) to compute the frame-level soft alignments required in the i-vector estimation process. We evaluate these techniques on the NIST 2010 SRE extended core conditions (C1-C9), as well as the 10sec-10sec condition. To our knowledge, results achieved by our system represent the best performances published to date on these conditions. For example, on the extended tel-tel condition (C5) the system achieves an EER of 0.59%. To garner further understanding of the remaining errors (on C5), we examine the recordings associated with the low scoring target trials, where various issues are identified for the problematic recordings/trials. Interestingly, it is observed that correcting the pathological recordings not only improves the scores for the target trials but also for the nontarget trials.

show abstract

Evidence Humans Provide When Explaining Data-Labeling Decisions

Newman

Wang

Zhao

et al. 2019

Human-Computer Interaction – INTERACT 2019

View full text Add to dashboard Cite

Because machine learning would benefit from reduced data requirements, some prior work has proposed using humans not just to label data, but also to explain those labels. To characterize the evidence humans might want to provide, we conducted a user study and a data experiment. In the user study, 75 participants provided classification labels for 20 photos, justifying those labels with free-text explanations. Explanations frequently referenced concepts (objects and attributes) in the image, yet 26% of explanations invoked concepts not in the image. Boolean logic was common in implicit form, but was rarely explicit. In a follow-up experiment on the Visual Genome dataset, we found that some concepts could be partially defined through their relationship to frequently co-occurring concepts, rather than only through labeling.

show abstract

The IBM 2015 English conversational telephone speech recognition system

Cited by 44 publications

References 23 publications

The IBM 2016 English Conversational Telephone Speech Recognition System

The IBM 2016 English Conversational Telephone Speech Recognition System

The IBM Speaker Recognition System: Recent Advances and Error Analysis

Evidence Humans Provide When Explaining Data-Labeling Decisions

Contact Info

Product

Resources

About