A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting

Bai, Ye; Yi, Jiangyan; Tao, Jianhua; Wen, Zhengqi; Tian, Zhengkun; Zhao, Chenghao; Fan, Cunhang

doi:10.21437/interspeech.2019-1676

Cited by 21 publications

(24 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the performance measure, the equal error rate (EER) and a false rejection rate (FRR) at a 1.0% false alarm rate (FAR) were used, which have been popularly used in literature, including [4][5][6][10][11][12]. The FRR and FAR represented the probability of falsely rejecting the WUW inputs and falsely accepting the non-WUW inputs, respectively.…”

Section: Resultsmentioning

confidence: 99%

“…In earlier research, a support vector machine (SVM) was used for the WUW recognition system [1]. Because the performance of deep neural network (DNN) systems has proven to be highly effective in many fields, there have been numerous efforts to build DNN-based WUW recognizers in various ways [2][3][4][5][6][7][8][9][10][11][12]. In [2], the bidirectional long short-term memory (BLSTM)-based end-to-end model was used to calculate the post-probability similar to the hybrid system, and the weighted finite-state transducers (WFSTs) were used to generate a confidence score from the calculated post-probability.…”

Section: Introductionmentioning

confidence: 99%

“…Many studies on small-footprint keyword spotting have shown effectiveness by employing different types of deep networks, including convolutional neural networks (CNN) [4], convolution recurrent networks [5], residual networks (ResNet) [6], and other variations [7][8][9]. Recently, the attention method was applied to small-footprint keyword recognition [10][11][12]. Various studies have also been conducted to improve recognition performance in noisy environments [13][14][15][16].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Small-Footprint Wake Up Word Recognition in Noisy Environments Employing Competing-Words-Based Feature

Yoon

Kim

2020

Electronics

View full text Add to dashboard Cite

This paper proposes a small-footprint wake-up-word (WUW) recognition system for real noisy environments by employing the competing-words-based feature. Competing-words-based features are generated using a ResNet-based deep neural network with small parameters using the competing-words dataset. The competing-words dataset consists of the most acoustically similar and dissimilar words to the WUW used for our system. The obtained features are used as input to the classification network, which is developed using the convolutional neural network (CNN) model. To obtain sufficient data for training, data augmentation is performed by using a room impulse response filter and adding sound signals of various television shows as background noise, which simulates an actual living room environment. The experimental results demonstrate that the proposed WUW recognition system outperforms the baselines that employ CNN and ResNet models. The proposed system shows 1.31% in equal error rate and 1.40% false rejection rate at a 1.0% false alarm rate, which are 29.57% and 50.00% relative improvements compared to the ResNet system, respectively. The number of parameters used for the proposed system is reduced by 83.53% compared to the ResNet system. These results prove that the proposed system with the competing-words-based feature is highly effective at improving WUW recognition performance in noisy environments with a smaller footprint.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Small-Footprint Wake Up Word Recognition in Noisy Environments Employing Competing-Words-Based Feature

Yoon

Kim

2020

Electronics

View full text Add to dashboard Cite

show abstract

“…In this paradigm (being new at the time), the sequence of word posterior probabilities yielded by a DNN is directly processed to determine the possible existence of keywords without the intervention of any HMM (see Figure 2). The deep KWS paradigm has recently attracted much attention [16], [26] due to a threefold reason:…”

Section: Introductionmentioning

confidence: 99%

Deep Spoken Keyword Spotting: An Overview

López-Espejo¹,

Hansen²,

Jensen³

2022

IEEE Access

View full text Add to dashboard Cite

Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.INDEX TERMS Keyword spotting, deep learning, acoustic model, small footprint, robustness.

show abstract

“…The computations are more parallelized, in the sense that the processing of a frame does not depend on the completion of processing other frames in the same layer. [15] also explored the self-attention in the keyword search (KWS) task. However, the original self-attention requires the entire input sequence to be available before any frames can be processed, and the computational complexity and memory usage are both O(T 2 ).…”

Section: Introductionmentioning

confidence: 99%

Wake Word Detection with Streaming Transformers

Wang

Povey

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

show abstract

A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting

Cited by 21 publications

References 14 publications

Small-Footprint Wake Up Word Recognition in Noisy Environments Employing Competing-Words-Based Feature

Small-Footprint Wake Up Word Recognition in Noisy Environments Employing Competing-Words-Based Feature

Deep Spoken Keyword Spotting: An Overview

Wake Word Detection with Streaming Transformers

Contact Info

Product

Resources

About