2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9004014
|View full text |Cite
|
Sign up to set email alerts
|

Query-by-Example On-Device Keyword Spotting

Abstract: A keyword spotting (KWS) system determines the existence of, usually predefined, keyword in a continuous speech stream. This paper presents a query-by-example on-device KWS system which is user-specific. The proposed system consists of two main steps: query enrollment and testing. In query enrollment step, phonetic posteriors are output by a small-footprint automatic speech recognition model based on connectionist temporal classification. Using the phoneticlevel posteriorgram, hypothesis graph of finite-state … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 20 publications
(7 citation statements)
references
References 26 publications
0
7
0
Order By: Relevance
“…Then, we follow the training details of CIFAR experiments in [17]. Keyword Spotting For keyword spotting, we use Qualcomm Keyword Speech Dataset 3 [20] and ResNet-26 [21] for the baseline network. Qualcomm keyword dataset contains 4,270 utterances of four English keywords spoken by 42-50 people.…”
Section: Methodsmentioning
confidence: 99%
“…Then, we follow the training details of CIFAR experiments in [17]. Keyword Spotting For keyword spotting, we use Qualcomm Keyword Speech Dataset 3 [20] and ResNet-26 [21] for the baseline network. Qualcomm keyword dataset contains 4,270 utterances of four English keywords spoken by 42-50 people.…”
Section: Methodsmentioning
confidence: 99%
“…For example, this is the case for the speech corpora reported in [26], [122] and [9], which were collected, respectively, from Mobvoi's TicKasa Fox, Google's Google Home and Xiaomi's AI Speaker smart speakers. Unfortunately, only seven out of twenty six datasets in Table 1 are publicly available: one from Sonos [169], two different arrangements of AISHELL-2 [199] (used in [98]), the Google Speech Commands Dataset v1 [153] and v2 [154], the Hey Snapdragon Keyword Dataset [200], and Hey Snips [78], [198] (also used in, e.g., [53], [177]). In case of interest in getting access to any of these speech corpora, the reader is pointed to the corresponding references indicated in Table 1.…”
Section: Datasetsmentioning
confidence: 99%
“…This is for example the case for voice activation of voice assistants, where privacy is a major concern [232] since this application involves streaming voice to a cloud server. As a result, a popular variant of the ROC and DET curves is that one replacing false positive rate along the xaxis by the number of false alarms per hour [8], [28], [31], [59], [60], [156], [162], [200]. By this, a practitioner can just set a very small number of false alarms per hour (e.g., 1) and identify the system with the highest (lowest) true positive (false negative) rate for deployment.…”
Section: B Receiver Operating Characteristic and Detection Error Trade-off Curvesmentioning
confidence: 99%
“…MHAtt-RNN [20] utilizes multi-head attention over it. On the other hand, there are automatic speech recognition based approaches [21,22]. Note that the approaches are successful but typically are not efficient in terms of the number of parameters compared to CNN-based approaches.…”
Section: Related Workmentioning
confidence: 99%