We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attentionbased systems employ bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we employ both LSTM and Transformer based architectures. All our systems are built using RWTH's open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attentionbased system by 15% relative on the clean and 40% relative on the other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.
We formulate a generalized hybrid HMM-NN training procedure using the full-sum over the hidden state-sequence and identify CTC as a special case of it. We present an analysis of the alignment behavior of such a training procedure and explain the strong localization of label output behavior of full-sum training (also referred to as peaky or spiky behavior). We show how to avoid that behavior by using a state prior. We discuss the temporal decoupling between output label position/time-frame, and the corresponding evidence in the input observations when this is trained with BLSTM models. We also show a way how to overcome this by jointly training a FFNN. We implemented the Baum-Welch alignment algorithm in CUDA to be able to do fast soft realignments on GPU. We have published this code along with some of our experiments as part of RETURNN, RWTH's extensible training framework for universal recurrent neural networks. We finish with experimental validation of our study on WSJ and Switchboard.
By default, statistical classification/multiple hypothesis testing is faced with the model mismatch introduced by replacing the true distributions in Bayes decision rule by model distributions estimated on training samples. Although a large number of statistical measures exist w.r.t. to the mismatch introduced, these works rarely relate to the mismatch in accuracy, i.e. the difference between model error and Bayes error. In this work, the accuracy mismatch between the ideal Bayes decision rule/Bayes test and a mismatched decision rule in statistical classification/multiple hypothesis testing is investigated explicitly. A proof of a novel generalized tight statistical bound on the accuracy mismatch is presented. This result is compared to existing statistical bounds related to the total variational distance that can be extended to bounds of the accuracy mismatch. The analytic results are supported by distribution simulations.
Training and testing many possible parameters or model architectures of state-of-the-art machine translation or automatic speech recognition system is a cumbersome task. They usually require a long pipeline of commands reaching from pre-processing the training data to post-processing and evaluating the output. This paper introduces Sisyphus, a tool that aims at managing scientific experiments in an efficient way. After defining the workflow for a given task, Sisyphus runs all required steps and ensures that all commands finish successfully. It avoids unnecessary computations by reusing tasks that are needed for multiple parts of the workflow and saves the user time by determining the order in which the tasks are to be performed. Since the program and workflow are written in Python they can be easily extended to contain arbitrary code. This makes it possible to use the rich collection of Python tools for editing, debugging, and documentation. It only has few requirements on the underlying server or cluster, and has been successfully tested in many large scale setups and can handle thousands of tasks inside the workflow.
It has been known for a long time that the classic Hidden-Markov-Model (HMM) derivation for speech recognition contains assumptions such as independence of observation vectors and weak duration modeling that are practical but unrealistic. When using the hybrid approach this is amplified by trying to fit a discriminative model into a generative one. Hidden Conditional Random Fields (CRFs) and segmental models (e.g. Semi-Markov CRFs / Segmental CRFs) have been proposed as an alternative, but for a long time have failed to get traction until recently. In this paper we explore different length modeling approaches for segmental models, their relation to attention-based systems. Furthermore we show experimental results on a handwriting recognition task and to the best of our knowledge the first reported results on the Switchboard 300h speech recognition corpus using this approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.