Absfrucf-The current state-of-the-art in large-vocabulary, continuous speech recognition is based on the use of hidden Markov models (HMM). In an attempt to improve over HMM performance, we developed a hybrid system that combines the advantages of neural networks and HMM using a multiple hypothesis (or N-best) paradigm. The connectionist component of the system, the segmental neural net (SNN), models all the frames of a phonetic segment simultaneously, thus overcoming the well-known conditional-independence limitation of HMM's. In the paper, we describe the hybrid system and we discuss various aspects of SNN modeling, including network architectures, training algorithms and context modeling. Finally, we evaluate the hybrid system by performing several speaker-independent experiments with the DARPA Resource Management (RM) corpus, and we demonstrate that the hybrid system shows a consistent improvement in performance over the baseline HMM system.
I. I NTRODUCTI ONONTINUOUS speech recognition (CSR) is in principle C a massive search procedure among all possible sentences (word sequences) allowed by the vocabulary and the grammar, and all possible alignments of each sentence with the input speech, to find the sentence and the alignment that are the most likely given the input speech. Because of the variability of individual words in time and the fact that words are connected, a successful CSR system must deal with many interrelated aspects of time modeling: the variability of parameters within a speech unit (such as a word or a phoneme) as a function of time, the variability that is introduced by the acoustic dependence of one phoneme on neighboring phonemes, and the need to model the input speech as a sequence of concatenated speech units, allowing for different possible alignments or segmentations. Finally, for each possible alignment between a hypothesized sequence of speech units and the input speech, the system must produce a global match score to indicate how closely it matches the input speech. The global score consists of a combination of the scores of the individual speech units.State-of-the-art CSR systems are based on the use of hidden Markov models (HMM) to model phonemes in context. Fig. 1 gives a summary of the basic functioning of a HMM and how it deals with the time modeling problems stated above.' Basic to are with BBN Systems and IEEE Log Number 9214815.'An extensive description of HMM is out of the scope of this paper.The interest reader may refer to [l] for a tutorial, and to [2] for a more detailed presentation. An easily readable overview of an HMM-based speech recognition system can be found at [3].Technologies, Cambridge, MA 02138. spectral frat" Fig. 1. Basic functioning of a hidden Markov model (HMM). Model: Thephonetic models use a number of states (three in this example) and left to right transitions. Associated with each transition is the probability of moving from one state to another. Associated with each state is a probability density function (pdf) the spectral features p(xlstate), f...
We describe four different ways in which we use the N-Best paradigm within the BYBLOS system. The most obvious use is for the efficient integration of speech recognition with a linguistic natural language understanding module. However, we have extended this principle to several other acoustic knowledge sources. We also describe a simple and efficient means for investigating and incorporating arbitrary new knowledge sources. The N-Best hypotheses are used to provide close altematives for discriminative training. Finally, we have developed a simple technique that allows us to optimize several weights and parameters within a system in a way that directly minimizes word error rate. Examples of each of these uses within the BYBMS system are described.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.