Markus Kitza scite author profile

We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attentionbased systems employ bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we employ both LSTM and Transformer based architectures. All our systems are built using RWTH's open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attentionbased system by 15% relative on the clean and 40% relative on the other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.

show abstract

The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment

Zhou

Michel

Irie

et al. 2020

View full text Add to dashboard Cite

We present a complete training pipeline to build a state-ofthe-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing model size and training time. A subsequent sMBR training is applied to fine-tune the final acoustic model, and both LSTM and Transformer language models are trained and evaluated. Our best system achieves a 5.6% WER on the test set, which outperforms the previous state-of-the-art by 27% relative.

show abstract

Cumulative Adaptation for BLSTM Acoustic Models

Kitza

Golik

Schlüter

et al. 2019

View full text Add to dashboard Cite

This paper addresses the robust speech recognition problem as an adaptation task. Specifically, we investigate the cumulative application of adaptation methods. A bidirectional Long Short-Term Memory (BLSTM) based neural network, capable of learning temporal relationships and translation invariant representations, is used for robust acoustic modeling. Further, ivectors were used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 8% relative improvement in word error rate on the NIST Hub5 2000 evaluation testset. By enhancing the first-pass i-vector based adaptation with a second-pass adaptation using speaker and environment dependent transformations within the network, a further relative improvement of 5% in word error rate was achieved. We have reevaluated the features used to estimate ivectors and their normalization to achieve the best performance in a modern large scale automatic speech recognition system.

show abstract

Comparison of BLSTM-Layer-Specific Affine Transformations for Speaker Adaptation

Kitza

Schlüter

Ney

2018

View full text Add to dashboard Cite

Bidirectional Long Short-Term Memory (BLSTM) Recurrent Neural Networks (RNN) acoustic models have demonstrated superior performance over Deep feed-forward Neural Networks (DNN) models in speech recognition and many other tasks. Although, a lot of work has been reported on DNN model adaptation, very little has been done on BLSTM model adaptation. This work presents a systematic study on the adaptation of BLSTM acoustic models by means of learning affine transformations within the neural network on small amounts of unsupervised adaptation data. Through a series of experiments on two major speech recognition benchmarks (Switchboard and CHiME-4), we investigate the significance of the position of the transformation in a BLSTM Network using a separate transformation for the forward-and backward-direction. We observe that applying affine transformations result in consistent relative word error rate reductions ranging from 6% to 11% depending on the task and the degree of mismatch between training and test data.

show abstract

The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment

Zhou

Michel

Irie

et al. 2020

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Markus Kitza

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention

The Rwth Asr System for Ted-Lium Release 2: Improving Hybrid Hmm With Specaugment

Cumulative Adaptation for BLSTM Acoustic Models

Comparison of BLSTM-Layer-Specific Affine Transformations for Speaker Adaptation

The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment

Contact Info

Product

Resources

About