Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition

Rose, Richard C.

doi:10.1006/csla.1995.0015

Cited by 42 publications

(15 citation statements)

References 26 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that the gray-shaded arrow in Fig. 4 pointing from q tr tÀ1 to q c t is only valid during the second training cycle when there are no segmentation constraints and will be ignored in Equation 5.…”

Section: Trainingmentioning

confidence: 99%

“…Since full spoken language understanding without any restriction of the expected vocabulary is hardly feasible and not necessarily needed in today's human-machine interaction scenarios (e. g. [4]), most systems apply keyword spotting as an alternative to large vocabulary continuous speech recognition. The aim of keyword spotting is to detect a set of predefined keywords from continuous speech signals [5]. When applied in human-like cognitive systems, keyword detectors have to process natural and spontaneous speech, which in contrast to well articulated read speech (as used in [6], for example) leads to comparatively low recognition rates [7].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

et al. 2010

View full text Add to dashboard Cite

Robustly detecting keywords in human speech is an important precondition for cognitive systems, which aim at intelligently interacting with users. Conventional techniques for keyword spotting usually show good performance when evaluated on well articulated read speech. However, modeling natural, spontaneous, and emotionally colored speech is challenging for today's speech recognition systems and thus requires novel approaches with enhanced robustness. In this article, we propose a new architecture for vocabulary independent keyword detection as needed for cognitive virtual agents such as the SEMAINE system. Our word spotting model is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network uses a self-learned amount of contextual information to provide a discrete phoneme prediction feature for the DBN, which is able to distinguish between keywords and arbitrary speech. We evaluate our Tandem BLSTM-DBN technique on both read speech and spontaneous emotional speech and show that our method significantly outperforms conventional Hidden Markov Model-based approaches for both application scenarios.

show abstract

Section: Trainingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

et al. 2010

View full text Add to dashboard Cite

show abstract

“…More reliable model estimation may be achieved by constructing keyword models as concatenations of phonetic HMMs. More recently, benefited from large vocabulary continuous speech recognition (LVCSR) techniques, a two-stage approach [8] is often shown to deliver good word-spotting results. In the first stage, the approach uses an LVCSR decoder to produce a set of hypothesized transcriptions, from which the presence of keywords are detected and verified in the second stage.…”

Section: Introductionmentioning

confidence: 99%

Discriminative training using non-uniform criteria for keyword spotting on spontaneous speech

Weng

Juang

2015

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In this work, we formulate the problem of keyword spotting as a non-uniform error automatic speech recognition (ASR) problem and propose a model training methodology based on the non-uniform minimum classification error (MCE) approach. The main idea is to adapt the fundamental MCE criteria to reflect the cost-sensitive notion in that errors on keywords are much more significant than errors on non-keywords in an automatic speech recognition task. The notion of cost sensitivity leads to emphasis of keyword models in parameter optimization. Then we present a system which takes advantage of the weighted finite-state transducer (WFST) framework to efficiently implement the non-uniform MCE. To enhance the approach of non-uniform error cost minimization for keyword spotting, we further formulate a technique called "adaptive boosted non-uniform MCE" which incorporates the idea of boosting. We validate the proposed framework on two challenging large-scale spontaneous conversational telephone speech (CTS) datasets in two different languages (English and Mandarin). Experimental results show our framework can achieve consistent and significant spotting performance gains over both the maximum likelihood estimation (MLE) baseline and conventional discriminatively-trained systems with uniform error cost.Index Terms-Discriminative training (DT), minimum classification error (MCE), non-uniform criteria, keyword spotting, weighted finite-state transducer (WFST).

show abstract

“…As a more robust strategy, word spotting approaches [5], [6] have been studied. They are classified into two approaches in terms of the modeling of non-keyword parts.…”

mentioning

confidence: 99%

Flexible speech understanding based on combined key-phrase detection and verification

Kawahara

Lee²,

Juang³

1998

IEEE Trans. Speech Audio Process.

View full text Add to dashboard Cite

We propose a novel speech understanding strategy based on combined detection and verification of semantically tagged key-phrases in spontaneous spoken utterances. Key-phrases are defined in a top-down manner so as to constitute semantic slots. Their detection directly leads to robust understanding. A phrase network realizes both a wide coverage and a reasonable constraint for detection. A subword-based verifier is then incorporated to reduce false alarms in detection and attach confidence measures of the detected phrases. This set of phrase confidence measures, when incorporated in a spoken dialogue system, forms a basis for designing intelligent speech interfaces that accept only verified key-phrases and reprompt users to clarify unspecified or unrecognized portions. Several forms of confidence measures based on subword-level tests are investigated. The proposed approach was tested on field data collected from realworld trial applications. The combined detection and verification strategy drastically improves the accuracy in handling out-ofgrammar utterances over the conventional decoding approaches while maintaining the performance for in-grammar utterances.

show abstract

Keyword detection in conversational speech utterances using hidden Markov model based continuous speech recognition

Cited by 42 publications

References 26 publications

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

Discriminative training using non-uniform criteria for keyword spotting on spontaneous speech

Flexible speech understanding based on combined key-phrase detection and verification

Contact Info

Product

Resources

About