DONUT: CTC-based Query-by-Example Keyword Spotting

Lugosch, Loren; Myer, Samuel; Tomar, Vikrant Singh

doi:10.48550/arxiv.1811.10736

Cited by 5 publications

(6 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, [8] used a seq2seq GRU autoencoder to encode the audio, and a CNN-RNN language model to encode the characters. In addition, [9] used Connectionist Temporal Classification beam search to produce a hypothesis set based on the posteriors of ASR. [10] used an RNN transducer model for predicting subword units.…”

Section: *Authors Contributed Equallymentioning

confidence: 99%

Query-By-Example Keyword Spotting System Using Multi-Head Attention and Soft-triple Loss

Huang¹,

Gharbieh²,

Shim³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper proposes a neural network architecture for tackling the query-by-example user-defined keyword spotting task. A multi-head attention module is added on top of a multi-layered GRU for effective feature extraction, and a normalized multi-head attention module is proposed for feature aggregation. We also adopt the softtriple loss -a combination of triplet loss and softmax loss -and showcase its effectiveness. We demonstrate the performance of our model on internal datasets with different languages and the public Hey-Snips dataset. We compare the performance of our model to a baseline system [1] and conduct an ablation study to show the benefit of each component in our architecture. The proposed work shows solid performance while preserving simplicity.

show abstract

Section: *Authors Contributed Equallymentioning

confidence: 99%

Query-By-Example Keyword Spotting System Using Multi-Head Attention and Soft-triple Loss

Huang¹,

Gharbieh²,

Shim³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…KWS Models: Keyword spotting is generally formulated as a task of classifying fixed length speech segments into a known vocabulary of keywords. Modern KWS models [1,4,11,12,13,14,15,16] are often based on neural network classifiers with audio features like Mel-frequency Cepstral Coefficients, extracted from the speech signal as input. Sainath and Parada [1] were among the first to explore CNNs for KWS under memory and compute constrained settings.…”

Section: Related Workmentioning

confidence: 99%

Teaching Keyword Spotters to Spot New Keywords with Limited Examples

Awasthi¹,

Kilgour²,

Rom³

2021

Interspeech 2021

View full text Add to dashboard Cite

Learning to recognize new keywords with just a few examples is essential for personalizing keyword spotting (KWS) models to a user's choice of keywords. However, modern KWS models are typically trained on large datasets and restricted to a small vocabulary of keywords, limiting their transferability to a broad range of unseen keywords. Towards easily customizable KWS models, we present KeySEM (Keyword Speech EMbedding), a speech embedding model pre-trained on the task of recognizing a large number of keywords. Speech representations offered by KeySEM are highly effective for learning new keywords from a limited number of examples. Comparisons with a diverse range of related work across several datasets show that our method achieves consistently superior performance with fewer training examples. Although KeySEM was pre-trained only on English utterances, the performance gains also extend to datasets from four other languages indicating that KeySEM learns useful representations well aligned with the task of keyword spotting. Finally, we demonstrate KeySEM's ability to learn new keywords sequentially without requiring to re-train on previously learned keywords. Our experimental observations suggest that KeySEM is well suited to on-device environments where postdeployment learning and ease of customization are often desirable.

show abstract

“…When the neural network is very small, it tends to make phoneme or grapheme prediction errors. The systems can take advantage of the network predicting phonemes or graphemes by augmenting the keyword set with alternative pronunciations, either estimated from the training set [27] or from examples spoken by the user [28]. Using the knowledge about the confusions of the network along with the peaky behaviour of CTC-trained networks, an efficient detection can be implemented based on a minimum edit distance search of the keywords in a compact phone lattice [29].…”

Section: Introductionmentioning

confidence: 99%

Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Bluche¹,

Primet²,

Gisselbrecht³

2020

Preprint

View full text Add to dashboard Cite

We explore a keyword-based spoken language understanding system, in which the intent of the user can directly be derived from the detection of a sequence of keywords in the query. In this paper, we focus on an open-vocabulary keyword spotting method, allowing the user to define their own keywords without having to retrain the whole model. We describe the different design choices leading to a fast and small-footprint system, able to run on tiny devices, for any arbitrary set of user-defined keywords, without training data specific to those keywords. The model, based on a quantized long short-term memory (LSTM) neural network, trained with connectionist temporal classification (CTC), weighs less than 500KB. Our approach takes advantage of some properties of the predictions of CTC-trained networks to calibrate the confidence scores and implement a fast detection algorithm. The proposed system outperforms a standard keyword-filler model approach.

show abstract

DONUT: CTC-based Query-by-Example Keyword Spotting

Cited by 5 publications

References 0 publications

Query-By-Example Keyword Spotting System Using Multi-Head Attention and Soft-triple Loss

Query-By-Example Keyword Spotting System Using Multi-Head Attention and Soft-triple Loss

Teaching Keyword Spotters to Spot New Keywords with Limited Examples

Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks

Contact Info

Product

Resources

About