2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017
DOI: 10.1109/asru.2017.8268974
|View full text |Cite
|
Sign up to set email alerts
|

Streaming small-footprint keyword spotting using sequence-to-sequence models

Abstract: We develop streaming keyword spotting systems using a recurrent neural network transducer (RNN-T) model: an all-neural, end-toend trained, sequence-to-sequence model which jointly learns acoustic and language model components. Our models are trained to predict either phonemes or graphemes as subword units, thus allowing us to detect arbitrary keyword phrases, without any out-ofvocabulary words. In order to adapt the models to the requirements of keyword spotting, we propose a novel technique which biases the R… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
66
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 79 publications
(67 citation statements)
references
References 43 publications
0
66
1
Order By: Relevance
“…Due to the limited data access, direct result comparison with previous works became difficult. Nevertheless, we compared our results with others in Table 2 to show that the results are comparable to that of predefined KWS systems [3,5,4] and query-by-example system [13]. Blanks in the table implies unknown information.…”
Section: Fst Constrained By Phonectic Hypothesismentioning
confidence: 84%
See 2 more Smart Citations
“…Due to the limited data access, direct result comparison with previous works became difficult. Nevertheless, we compared our results with others in Table 2 to show that the results are comparable to that of predefined KWS systems [3,5,4] and query-by-example system [13]. Blanks in the table implies unknown information.…”
Section: Fst Constrained By Phonectic Hypothesismentioning
confidence: 84%
“…The query word, 'Hey Snips' is short and false alarms are more likely to occur. The performance is heavily influenced by the type of keyword and this result is also specified in [13].…”
Section: Fst Constrained By Phonectic Hypothesismentioning
confidence: 95%
See 1 more Smart Citation
“…Amazon Alexa, Google Assistant, Apple Siri), spoken term classification does not have the low-latency constraint since the classification is done at utterance level. Previous works [16,17,18,19] showed that neural networks are very effective in keyword spotting. As tremendous efforts are dedicated into the discovery of effective CNN architectures for further advancing the performance, we argue that it is also important to investigate into effective ways for utilizing computational resource at inference time.…”
Section: Introductionmentioning
confidence: 99%
“…These methods have demonstrated computational efficiency but failed in capturing local receptive fields and short range context. Various attempts have also been made to build a KWS system with recurrent neural networks (RNNs) [15,16,17,18,19], which is capable of modeling longer temporal context information. However, RNNs may suffer from state saturation while facing continuous input stream, increasing computational cost and detection latency.…”
Section: Introductionmentioning
confidence: 99%