2018
DOI: 10.48550/arxiv.1811.10736
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DONUT: CTC-based Query-by-Example Keyword Spotting

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 0 publications
0
6
0
Order By: Relevance
“…Similarly, [8] used a seq2seq GRU autoencoder to encode the audio, and a CNN-RNN language model to encode the characters. In addition, [9] used Connectionist Temporal Classification beam search to produce a hypothesis set based on the posteriors of ASR. [10] used an RNN transducer model for predicting subword units.…”
Section: *Authors Contributed Equallymentioning
confidence: 99%
“…Similarly, [8] used a seq2seq GRU autoencoder to encode the audio, and a CNN-RNN language model to encode the characters. In addition, [9] used Connectionist Temporal Classification beam search to produce a hypothesis set based on the posteriors of ASR. [10] used an RNN transducer model for predicting subword units.…”
Section: *Authors Contributed Equallymentioning
confidence: 99%
“…KWS Models: Keyword spotting is generally formulated as a task of classifying fixed length speech segments into a known vocabulary of keywords. Modern KWS models [1,4,11,12,13,14,15,16] are often based on neural network classifiers with audio features like Mel-frequency Cepstral Coefficients, extracted from the speech signal as input. Sainath and Parada [1] were among the first to explore CNNs for KWS under memory and compute constrained settings.…”
Section: Related Workmentioning
confidence: 99%
“…When the neural network is very small, it tends to make phoneme or grapheme prediction errors. The systems can take advantage of the network predicting phonemes or graphemes by augmenting the keyword set with alternative pronunciations, either estimated from the training set [27] or from examples spoken by the user [28]. Using the knowledge about the confusions of the network along with the peaky behaviour of CTC-trained networks, an efficient detection can be implemented based on a minimum edit distance search of the keywords in a compact phone lattice [29].…”
Section: Introductionmentioning
confidence: 99%