Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-1393
|View full text |Cite
|
Sign up to set email alerts
|

Model Compression Applied to Small-Footprint Keyword Spotting

Abstract: Several consumer speech devices feature voice interfaces that perform on-device keyword spotting to initiate user interactions. Accurate on-device keyword spotting within a tight CPU budget is crucial for such devices. Motivated by this, we investigated two ways to improve deep neural network (DNN) acoustic models for keyword spotting without increasing CPU usage. First, we used low-rank weight matrices throughout the DNN. This allowed us to increase representational power by increasing the number of hidden no… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
55
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 87 publications
(55 citation statements)
references
References 14 publications
(23 reference statements)
0
55
0
Order By: Relevance
“…We focus on improving the ASR performance under multimedia noise which is commonly present at home. T/S learning was at first explored in speech community [22] [23] to distill the knowledge from bigger models to a smaller one, and was successfully applied in the areas of ASR [24,25] and keyword spotting [26] afterwards. Instead of knowledge distillation, we adopt the T/S learning for domain adaptation which was proposed in [27] to build an ASR system performing more robustly under multimedia noise.…”
Section: Introductionmentioning
confidence: 99%
“…We focus on improving the ASR performance under multimedia noise which is commonly present at home. T/S learning was at first explored in speech community [22] [23] to distill the knowledge from bigger models to a smaller one, and was successfully applied in the areas of ASR [24,25] and keyword spotting [26] afterwards. Instead of knowledge distillation, we adopt the T/S learning for domain adaptation which was proposed in [27] to build an ASR system performing more robustly under multimedia noise.…”
Section: Introductionmentioning
confidence: 99%
“…our first stage, always-on network has about 13K parameters [8]). The SVDF also pairs very well with linear bottleneck layers to significantly reduce the param- eter count as in [18,19], and more recently in [10]. And because it allows for creating evenly sized deep networks, we can insert them throughout the network as in Figure 3.…”
Section: Efficient Memoized Neural Network Topologymentioning
confidence: 97%
“…Keyword (KW) spotting in continuous speech has been an area of research for more than two decades [14]- [23]. In much of recent work, latency and computation are not concerns, and offline large vocabulary speech recognition systems can be used to decode the audio utterances and create transcripts or lattices of words and/or phones which can then be searched for the presence of the keyword(s) of interest [14]- [16].…”
Section: Abstractpottingmentioning
confidence: 99%
“…The background model is also sometimes called the filler or garbage model, and may be a simple speech/non-speech loop HMM [18], or may involve a loop over phones or words [21]. With the growing success of deep learning in recent years, novel techniques using Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs) that do not involve HMMs have also been proposed [22,6,23].…”
Section: Abstractpottingmentioning
confidence: 99%