Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1979
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Keyword Spotting Using Time Delay Neural Networks

Abstract: This paper describes a novel method of live keyword spotting using a two-stage time delay neural network. The model is trained using transfer learning: initial training with phone targets from a large speech corpus is followed by training with keyword targets from a smaller data set. The accuracy of the system is evaluated on two separate tasks. The first is the freely available Google Speech Commands dataset. The second is an in-house task specifically developed for keyword spotting. The results show signific… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(34 citation statements)
references
References 9 publications
0
34
0
Order By: Relevance
“…The majority of cited works use MFCC or Log Mel-filterbank features. In this area, we see the reduction of the inductive bias over the time: more and more recent papers like (Raziel and Hyun-Jin, 2018) or (Myer and Tomar, 2018) don not use DCT-step, probably because deep neural networks work reasonably well even with correlated features. We expect further simplification: using raw waveform or some unsupervised approach like contrastive predictive coding in Oord et al (2018).…”
Section: Resultsmentioning
confidence: 93%
See 2 more Smart Citations
“…The majority of cited works use MFCC or Log Mel-filterbank features. In this area, we see the reduction of the inductive bias over the time: more and more recent papers like (Raziel and Hyun-Jin, 2018) or (Myer and Tomar, 2018) don not use DCT-step, probably because deep neural networks work reasonably well even with correlated features. We expect further simplification: using raw waveform or some unsupervised approach like contrastive predictive coding in Oord et al (2018).…”
Section: Resultsmentioning
confidence: 93%
“…Whole word (Morgan et al, 1990;Rose and Paul, 1990;Naylor et al, 1992;Rohlicek et al, 1993;Cuayáhuitl and Serridge, 2002;Baljekar et al, 2014;Chen et al, 2014a;Zehetner et al, 2014;Hou et al, 2016;Manor and Greenberg, 2017;Fernández-Marqués et al, 2018;Myer and Tomar, 2018) Monophone (Rose and Paul, 1990;Rohlicek et al, 1993;Cuayáhuitl and Serridge, 2002;Heracleous and Shimizu, 2003;Szöke et al, 2005;Lehtonen, 2005;Silaghi and Vargiya, 2005;Wöllmer et al, 2009b;Jansen and Niyogi, 2009a,c;Wöllmer et al, 2009a;Szöke et al, 2010;Shokri et al, 2011;Tabibian et al, 2011;Hou et al, 2016;Kumatani et al, 2017;Gruenstein et al, 2017;Tabibian et al, 2018;Myer and Tomar, 2018) Triphone (Rose and Paul, 1990;Szöke et al, 2005) Part of the word (Naylor et al, 1992;Li and Wang, 2014;Chen et al, 2014a) State unit (Zeppenfeld and Waibel, 1992) Part of the phoneme (Rohlicek et al, 1989;Kosonocky and Mammone, 1995;Leow et al, 2012) Syllable (Klemm et al, 1995;…”
Section: Acoustic Unit Sourcesmentioning
confidence: 99%
See 1 more Smart Citation
“…When the confident score exceeds a threshold, the keyword is detected. This approach shows the small footprint, low computational cost, low latency, high performance, and draws much attention recently [4,5,6]. However, previous work still use several hundred thousand parameters to achieve state-of-the-art performance.…”
Section: Introductionmentioning
confidence: 99%
“…However, because of the number of hidden layers and filters, their best model still has more than 200K parameters. A stacked time delay neural network (TDNN) based model with transfer learning was proposed [5]. However, the stacked network architecture makes the model size large.…”
Section: Introductionmentioning
confidence: 99%