Abstract-We propose a new approach to the problem of estimating the hyperparameters which define the inter-speaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10-15% reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task.
State of the art speaker recognition systems are based on the ivector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNN's ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.
This paper presents a new class of A* algorithms for Viterbi phonetic decoding subject to lexical constraints. This type of algorithm can be made to run substantially faster than the Viterbi algorithm in an isolated word recognizer having a vocabulary of 1600 words. In addition, multiple recognition hypotheses can be generated on demand and the search can be constrained to respect conditions on phone durations in such a way that computational requirements are substantially reduced.Results are presented on a 60 OOO word recognition task.
In this study we demonstrate the effectiveness of phonemic hidden Markov models with Gaussian mixture output densities (mixture HMM's) for speaker-dependent large-vocabulary word recognition. Speech recognition experiments show that for almost any reasonable amount of training data, recognizers using mixture HMM's consistently outperform those employing unimodal Gaussian HMM's. With a sufficiently large training set (e.g., more then 2500 words), use of HMM's with 25-component mixture distributions typically reduces recognition errors by about 40%. We also found that the mixture HMM's outperform a set of unimodal generalized triphone models having the same number of parameters. Previous attempts to employ mixture HMM's for speech recognition proved discouraging because of the high complexity and computational cost in implementing the Baum-Welch training algorithm. We show how mixture HMM's can be implemented very simply in the unimodal transition-based frameworks by the device of allowing multiple transitions from one state to another.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.