Efficient Keyword Spotting Using Time Delay Neural Networks

Myer, Samuel; Tomar, Vikrant Singh

doi:10.21437/interspeech.2018-1979

Cited by 28 publications

(34 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The majority of cited works use MFCC or Log Mel-filterbank features. In this area, we see the reduction of the inductive bias over the time: more and more recent papers like (Raziel and Hyun-Jin, 2018) or (Myer and Tomar, 2018) don not use DCT-step, probably because deep neural networks work reasonably well even with correlated features. We expect further simplification: using raw waveform or some unsupervised approach like contrastive predictive coding in Oord et al (2018).…”

Section: Resultsmentioning

confidence: 93%

“…Whole word (Morgan et al, 1990;Rose and Paul, 1990;Naylor et al, 1992;Rohlicek et al, 1993;Cuayáhuitl and Serridge, 2002;Baljekar et al, 2014;Chen et al, 2014a;Zehetner et al, 2014;Hou et al, 2016;Manor and Greenberg, 2017;Fernández-Marqués et al, 2018;Myer and Tomar, 2018) Monophone (Rose and Paul, 1990;Rohlicek et al, 1993;Cuayáhuitl and Serridge, 2002;Heracleous and Shimizu, 2003;Szöke et al, 2005;Lehtonen, 2005;Silaghi and Vargiya, 2005;Wöllmer et al, 2009b;Jansen and Niyogi, 2009a,c;Wöllmer et al, 2009a;Szöke et al, 2010;Shokri et al, 2011;Tabibian et al, 2011;Hou et al, 2016;Kumatani et al, 2017;Gruenstein et al, 2017;Tabibian et al, 2018;Myer and Tomar, 2018) Triphone (Rose and Paul, 1990;Szöke et al, 2005) Part of the word (Naylor et al, 1992;Li and Wang, 2014;Chen et al, 2014a) State unit (Zeppenfeld and Waibel, 1992) Part of the phoneme (Rohlicek et al, 1989;Kosonocky and Mammone, 1995;Leow et al, 2012) Syllable (Klemm et al, 1995;…”

Section: Acoustic Unit Sourcesmentioning

confidence: 99%

“…Comparing to threshold (Morgan et al, 1990;Naylor et al, 1992;Junkawitsch et al, 1997;Keshet et al, 2009;Wöllmer et al, 2009b,a;Li and Wang, 2014;Chen et al, 2014a;Gruenstein et al, 2017;Benisty et al, 2018;Myer and Tomar, 2018) Viterby (Rose and Paul, 1990;Feng and Mazor, 1992;Wilcox and Bush, 1992;Rohlicek et al, 1993;Knill and Young, 1996;Junkawitsch et al, 1997;Zheng et al, 1999;Liu et al, 2000;Vasilache and Vasilache, 2009;Tabibian et al, 2011;Leow et al, 2012;Zhu et al, 2013;Kumatani et al, 2017;Ge and Yan, 2017;Sun et al, 2017) Forward-Backward algorithm (Wilcox and Bush, 1992;Rohlicek et al, 1993) DTW (Zeppenfeld and Waibel, 1992;Kosonocky and Mammone, 1995;Kurniawati et al, 2012;Zehetner et al, 2014;Hou et al, 2016) Likelihood ratio (Jansen and Niyogi, 2009c;Szöke et al, 2010) Fuzzy logic (Manor and Greenberg, 2017) Table 10 The metrics used in studied sources.…”

Section: Decoding Approach Sourcesmentioning

confidence: 99%

See 2 more Smart Citations

Voice Activation Systems for Embedded Devices: Systematic Literature Review

Kolesau¹,

Šešok²

2020

Informatica

View full text Add to dashboard Cite

A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control. Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy. Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example, Ok, Google) and to activate the voice request processing system when it is found. The voice activation system must have the following properties: high accuracy, ability to work entirely on the device (without using remote servers), consumption of a small amount of resources (primarily CPU and RAM), noise resistance and variability of speech, as well as a small delay between the pronunciation of the key phrase and the system activation. This work is a systematic literature review on voice activation systems that satisfy the above properties. We describe the principle of various voice activation systems' operation, the characteristic representation of sound in such systems, consider in detail the acoustic modelling and, finally, describe the approaches used to assess the models' quality. In addition, we point to a number of open questions in this problem.

show abstract

Section: Resultsmentioning

confidence: 93%

Section: Acoustic Unit Sourcesmentioning

confidence: 99%

Section: Decoding Approach Sourcesmentioning

confidence: 99%

See 1 more Smart Citation

Voice Activation Systems for Embedded Devices: Systematic Literature Review

Kolesau¹,

Šešok²

2020

Informatica

View full text Add to dashboard Cite

show abstract

“…When the confident score exceeds a threshold, the keyword is detected. This approach shows the small footprint, low computational cost, low latency, high performance, and draws much attention recently [4,5,6]. However, previous work still use several hundred thousand parameters to achieve state-of-the-art performance.…”

Section: Introductionmentioning

confidence: 99%

“…However, because of the number of hidden layers and filters, their best model still has more than 200K parameters. A stacked time delay neural network (TDNN) based model with transfer learning was proposed [5]. However, the stacked network architecture makes the model size large.…”

Section: Introductionmentioning

confidence: 99%

A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting

Bai

Tao

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Keyword spotting requires a small memory footprint to run on mobile devices. However, previous works still use several hundred thousand parameters to achieve good performance. To address this issue, we propose a time delay neural network with shared weight self-attention for small-footprint keyword spotting. By sharing weights, the parameters of self-attention are reduced but without performance reduction. The publicly available Google Speech Commands dataset is used to evaluate the models. The number of parameters (12K) of our model is 1/20 of state-of-the-art ResNet model (239K). The proposed model achieves an error rate of 4.19% , which is comparable to the ResNet model.

show abstract