Streaming small-footprint keyword spotting using sequence-to-sequence models

He, Yanzhang; Prabhavalkar, Rohit; Rao, Kanishka; Li, Wei; Bakhtin, Anton; McGraw, Ian

doi:10.1109/asru.2017.8268974

Cited by 79 publications

(67 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to the limited data access, direct result comparison with previous works became difficult. Nevertheless, we compared our results with others in Table 2 to show that the results are comparable to that of predefined KWS systems [3,5,4] and query-by-example system [13]. Blanks in the table implies unknown information.…”

Section: Fst Constrained By Phonectic Hypothesismentioning

confidence: 84%

“…The query word, 'Hey Snips' is short and false alarms are more likely to occur. The performance is heavily influenced by the type of keyword and this result is also specified in [13].…”

Section: Fst Constrained By Phonectic Hypothesismentioning

confidence: 95%

“…Recently, end-to-end NN based query-by-example systems are suggested [13,14]. [13] uses a recurrent neural network transducer (RNN-T) model biased with attention over keyword. [14] suggests to use text query instead of audio.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Query-by-Example On-Device Keyword Spotting

Kim¹,

Lee²,

Lee³

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

A keyword spotting (KWS) system determines the existence of, usually predefined, keyword in a continuous speech stream. This paper presents a query-by-example on-device KWS system which is user-specific. The proposed system consists of two main steps: query enrollment and testing. In query enrollment step, phonetic posteriors are output by a small-footprint automatic speech recognition model based on connectionist temporal classification. Using the phoneticlevel posteriorgram, hypothesis graph of finite-state transducer (FST) is built, thus can enroll any keywords thus avoiding an out-of-vocabulary problem. In testing, a log-likelihood is scored for input audio using the FST. We propose a threshold prediction method while using the user-specific keyword hypothesis only. The system generates query-specific negatives by rearranging each query utterance in waveform. The threshold is decided based on the enrollment queries and generated negatives. We tested two keywords in English, and the proposed work shows promising performance while preserving simplicity.

show abstract

Section: Fst Constrained By Phonectic Hypothesismentioning

confidence: 84%

“…The query word, 'Hey Snips' is short and false alarms are more likely to occur. The performance is heavily influenced by the type of keyword and this result is also specified in [13].…”

Section: Fst Constrained By Phonectic Hypothesismentioning

confidence: 95%

See 1 more Smart Citation

Query-by-Example On-Device Keyword Spotting

Kim¹,

Lee²,

Lee³

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

show abstract

“…Amazon Alexa, Google Assistant, Apple Siri), spoken term classification does not have the low-latency constraint since the classification is done at utterance level. Previous works [16,17,18,19] showed that neural networks are very effective in keyword spotting. As tremendous efforts are dedicated into the discovery of effective CNN architectures for further advancing the performance, we argue that it is also important to investigate into effective ways for utilizing computational resource at inference time.…”

Section: Introductionmentioning

confidence: 99%

Sub-Band Convolutional Neural Networks for Small-Footprint Spoken Term Classification

Kao¹,

Sun²,

Gao³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

This paper proposes a Sub-band Convolutional Neural Network for spoken term classification. Convolutional neural networks (CNNs) have proven to be very effective in acoustic applications such as spoken term classification, keyword spotting, speaker identification, acoustic event detection, etc. Unlike applications in computer vision, the spatial invariance property of 2D convolutional kernels does not fit acoustic applications well since the meaning of a specific 2D kernel varies a lot along the feature axis in an input feature map. We propose a sub-band CNN architecture to apply different convolutional kernels on each feature sub-band, which makes the overall computation more efficient. Experimental results show that the computational efficiency brought by sub-band CNN is more beneficial for smallfootprint models. Compared to a baseline full band CNN for spoken term classification on a publicly available Speech Commands dataset, the proposed sub-band CNN architecture reduces the computation by 39.7% on commands classification, and 49.3% on digits classification with accuracy maintained.

show abstract

“…These methods have demonstrated computational efficiency but failed in capturing local receptive fields and short range context. Various attempts have also been made to build a KWS system with recurrent neural networks (RNNs) [15,16,17,18,19], which is capable of modeling longer temporal context information. However, RNNs may suffer from state saturation while facing continuous input stream, increasing computational cost and detection latency.…”

Section: Introductionmentioning

confidence: 99%

Small-Footprint Keyword Spotting with Graph Convolutional Network

Chen

Yin

Song³

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Despite the recent successes of deep neural networks, it remains challenging to achieve high precision keyword spotting task (KWS) on resource-constrained devices. In this study, we propose a novel context-aware and compact architecture for keyword spotting task. Based on residual connection and bottleneck structure, we design a compact and efficient network for KWS task. To leverage the long range dependencies and global context of the convolutional feature maps, the graph convolutional network is introduced to encode the nonlocal relations. By evaluated on the Google Speech Command Dataset, the proposed method achieves state-of-the-art performance and outperforms the prior works by a large margin with lower computational cost.

show abstract

Streaming small-footprint keyword spotting using sequence-to-sequence models

Cited by 79 publications

References 43 publications

Query-by-Example On-Device Keyword Spotting

Query-by-Example On-Device Keyword Spotting

Sub-Band Convolutional Neural Networks for Small-Footprint Spoken Term Classification

Small-Footprint Keyword Spotting with Graph Convolutional Network

Contact Info

Product

Resources

About