In the tandem approach to modeling the acoustic signal, a neural-net preprocessor is first discriminatively trained to estimate posterior probabilities across a phone set. These are then used as feature inputs for a conventional hidden Markov model (HMM) based speech recognizer, which relearns the associations to subword units. In this paper, we apply the tandem approach to the data provided for the first Speech in Noisy Environments (SPINE1) evaluation conducted by the Naval Research Laboratory (NRL) in August 2000. In our previous experience with the ETSI Aurora noisy digits (a small-vocabulary, high-noise task) the tandem approach achieved error-rate reductions of over 50% relative to the HMM baseline. For SPINE1, a larger task involving more spontaneous speech, we find that, when context-independent models are used, the tandem features continue to result in large reductions in word-error rates relative to those achieved by systems using standard MFC or PLP features. However, these improvements do not carry over to context-dependent models. This may be attributable to several factors which are discussed in the paper.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.