Most speech recognition systems try to reconstruct a word sequence given an acoustic input, using prior information about the language being spoken. In some cases, there is more information available to the decoder than simply the acoustics. When decoding a television news broadcast, for example, the closed-caption information that is often recorded for hearing impaired viewers may also be available. While these captions are generally not completely accurate transcriptions, they can be considered to be a strong hint as to what was actually spoken.In this paper, we present a formalization of this problem in terms of the source channel paradigm. We propose a simple translation model for mapping caption sequences to word sequences which updates the language model with the prior information inherent in the captions. We also describe an efficient implementation of the search in a Viterbi decoder, and present results using this system in the broadcast news domain.
We describe four different ways in which we use the N-Best paradigm within the BYBLOS system. The most obvious use is for the efficient integration of speech recognition with a linguistic natural language understanding module. However, we have extended this principle to several other acoustic knowledge sources. We also describe a simple and efficient means for investigating and incorporating arbitrary new knowledge sources. The N-Best hypotheses are used to provide close altematives for discriminative training. Finally, we have developed a simple technique that allows us to optimize several weights and parameters within a system in a way that directly minimizes word error rate. Examples of each of these uses within the BYBMS system are described.
This paper presents speech recognition test results from the BBN BYBLOS system on the Feb 91 DARPA benchmarks in both the Resource Management (RM) and the Air Travel Information System (ATIS) domains. In the RM test, we report on speaker-independent (SI) recognition performance for the standard training condition using 109 speakers and for our recently proposed SI model made from only 12 training speakers. Surprisingly, the 12-speaker model performs as well as the one made from 109 speakers. Also within the RM domain, we demonstrate that state-of-the-art SI models perform poorly for speakers with strong dialects. But we show that this degradation can be overcome by using speaker adaptation from multiple-reference speakers. For the ATIS benchmarks, we ran a new system conliguration which first produced a rank-ordered list of the N-best wordsequence hypotheses. The list of hypotheses was then reordered using more detailed acoustic and language models. In the ATIS benchmarks, we report SI recognition results on two conditions. The first is a baseline condition using only training data available from NIST on CD-ROM and a word-based statistical hi-gram grammar developed at MIT/Lincoln. In the second condition, we added training data from speakers collected at BBN and used a 4-gram class grammar. These changes reduced the word error rate by 25%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.