2014 IEEE Spoken Language Technology Workshop (SLT) 2014
DOI: 10.1109/slt.2014.7078631
|View full text |Cite
|
Sign up to set email alerts
|

Online word-spotting in continuous speech with recurrent neural networks

Abstract: In this paper we introduce a simplified architecture for gated recurrent neural networks that can be used in single-pass applications, where word-spotting needs to be done in real-time and phoneme-level information is not available for training. The network operates as a self-contained block in a strictly forward-pass configuration to directly generate keyword labels. We call these simple networks causal networks, where the current output is only weighted by the the past inputs and outputs. Since the basic net… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
1

Year Published

2015
2015
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 29 publications
(18 citation statements)
references
References 9 publications
(12 reference statements)
0
17
1
Order By: Relevance
“…What we saw is that our results in terms of keyword search quality fall in between those reported for Cantonese when GMMs are used in the acoustic model and are slightly worse when deep neural networks are used (MTWV 0.335 and 0.441, resp.). As for the real-time factor our results outperform those reported in [14], which may be attributed to a relatively small number of Gaussians we use per senone.…”
Section: Resultscontrasting
confidence: 44%
See 1 more Smart Citation
“…What we saw is that our results in terms of keyword search quality fall in between those reported for Cantonese when GMMs are used in the acoustic model and are slightly worse when deep neural networks are used (MTWV 0.335 and 0.441, resp.). As for the real-time factor our results outperform those reported in [14], which may be attributed to a relatively small number of Gaussians we use per senone.…”
Section: Resultscontrasting
confidence: 44%
“…An average MTWV reported for these languages ranges from 0.22 for Zulu to 0.67 for Haitian Creole. In [14] the use of recurrent neural networks for example-based word spotting in real time for English is described. Compared to more widespread textbased systems, this approach makes use of spoken examples of a keyword to build up a word-based model and then do the search within speech data.…”
Section: Introductionmentioning
confidence: 99%
“…They have demonstrated efficiency in terms of inference speed and computational cost but fail at capturing large patterns with reasonably small models. Recent works have suggested RNN based keyword spotting using LSTM cells that can leverage longer temporal context using gating mechanism and internal states [7,8,9]. However, because RNNs may suffer from state saturation when facing continuous input streams [10], their internal state needs to be periodically reset.…”
Section: Introductionmentioning
confidence: 99%
“…Alternative segment-level loss functions include different statistics of frame-level keyword posteriors within a keyword segment, e.g., the geometric mean etc. There have been literatures on training LSTMs using Connectionist Temporal Classification (CTC) [14,15,16,23] for keyword spotting tasks as well. In addition, architectures that combine LSTMs and CNNs have been applied to different tasks [24,25].…”
Section: Max-poolingmentioning
confidence: 99%