ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053193
|View full text |Cite
|
Sign up to set email alerts
|

Training Keyword Spotters with Limited and Synthesized Speech Data

Abstract: With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level fea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
44
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 38 publications
(44 citation statements)
references
References 17 publications
0
44
0
Order By: Relevance
“…Regardless, the application of synthetic data in training low-resource keyword spotter systems has shown promise in recent experiments. Specifically, it was demonstrated that by utilizing a pre-trained speech-embedding model with approximately 400K parameters and weights initialized using human audio data, subsequent training on approximately 2000 synthetic voice examples produced a model with performance only slightly worse than the same model trained with the same number of organic audio examples [4]. The experiments in this paper build on these past works by applying the idea to a new model architecture and training environment.…”
Section: Related Workmentioning
confidence: 90%
See 1 more Smart Citation
“…Regardless, the application of synthetic data in training low-resource keyword spotter systems has shown promise in recent experiments. Specifically, it was demonstrated that by utilizing a pre-trained speech-embedding model with approximately 400K parameters and weights initialized using human audio data, subsequent training on approximately 2000 synthetic voice examples produced a model with performance only slightly worse than the same model trained with the same number of organic audio examples [4]. The experiments in this paper build on these past works by applying the idea to a new model architecture and training environment.…”
Section: Related Workmentioning
confidence: 90%
“…One of the most important steps in ensuring a frictionless experience for customers of voice assistants is "waking" the device up for interaction when the customer intends to use the device, and, importantly, not "waking" when the customer does not. This is accomplished with specialized "wakeword models" that detect keywords spoken by users and initiate interactions with the device [2,3,4].…”
Section: Introductionmentioning
confidence: 99%
“…A head model has shown the benefits of embedding, which is built on learning many short utterances [23]. The head model quickly converges to the model from the shared weights of pretrained embeddings.…”
Section: B Kwsmentioning
confidence: 99%
“…Consequently, neural network optimization becomes difficult given the scarce training data available. To overcome data scarcity for KWS, a pretrained head model and synthesized data have been used [23]. Lin et al [23] state that building a state-of-the-art (SOTA) KWS model requires more than 4000 recorded human speech samples per command.…”
Section: Introductionmentioning
confidence: 99%
“…The presence of the trigger phrase at the beginning of an utterance helps distinguish audio that is directed towards the assistant from background speech. The problem of accurately detecting a trigger phrase is known as voice trigger detection [1,2], wake-word detection [3,4], or keyword spotting [5,6,7,8,9]. This work is motivated by the observation that audio following the trigger phrase can contain a strong signal about whether an utterance was directed towards the assistant or not.…”
Section: Introductionmentioning
confidence: 99%