We describe the design of a voice trigger detection system for smart speakers. In this study, we address two major challenges. The first is that the detectors are deployed in complex acoustic environments with external noise and loud playback by the device itself. Secondly, collecting training examples for a specific keyword or trigger phrase is challenging resulting in a scarcity of trigger phrase specific training data. We describe a two-stage cascaded architecture where a low-power detector is always running and listening for the trigger phrase. If a detection is made at this stage, the candidate audio segment is re-scored by larger, more complex models to verify that the segment contains the trigger phrase. In this study, we focus our attention on the architecture and design of these second-pass detectors. We start by training a general acoustic model that produces phonetic transcriptions given a large labelled training dataset. Next, we collect a much smaller dataset of examples that are challenging for the baseline system. We then use multi-task learning to train a model to simultaneously produce accurate phonetic transcriptions on the larger dataset and discriminate between true and easily confusable examples using the smaller dataset. Our results demonstrate that the proposed model reduces errors by half compared to the baseline in a range of challenging test conditions without requiring extra parameters.
We describe the architecture of an always-on keyword spotting (KWS) system for battery-powered mobile devices used to initiate an interaction with the device. An always-available voice assistant needs a carefully designed voice keyword detector to satisfy the power and computational constraints of battery powered devices. We employ a multi-stage system that uses a low-power primary stage to decide when to run a more accurate (but more power-hungry) secondary detector. We describe a straightforward primary detector and explore variations that result in very useful reductions in computation (or increased accuracy for the same computation). By reducing the set of target labels from three to one per phone, and reducing the rate at which the acoustic model is operated, the compute rate can be reduced by a factor of six while maintaining the same accuracy.
Conversational speech recognition is a challenging problem primarily because speakers rarely fully articulate sounds. A successful speech recognition approach must infer intended spectral targets from the speech data, or develop a method of dealing with large variances in the data. Hidden Dynamic Models (HDMs) attempt to automatically learn such targets in a hidden feature space using models that integrate linguistic information with constrained temporal trajectory models. HDMs are a radical departure from conventional hidden Markov models (HMMs), which simply account for variation in the observed data. In this paper, we present an initial evaluation of such models on a conversational speech recognition task involving a subset of the SWITCHBOARD corpus. We show that in an N-Best rescoring paradigm, HDMs are capable of delivering performance competitive with Hh4Ms. produce more consistent acoustic scoring for conversational speech, because sounds are rarely fully articulated in such data. Tremendous amounts of variation are observed in the speech data because of the manner in which the realization of a sound was truncated is highly context-dependent. It is the goal of this work to produce acoustic scores that reflect measurements in the hidden (or target) space, rather than directly in the feature space as is currently done in context-dependent phonetic modeling.The work presented here was the culmination of an intense effort at the 1998 NSF Workshop on Language Engineering held at the Center for Language and Speech Processing at Johns Hopkins University. One goal of this work, which is the primary focus of this paper, was to evaluate the HDM approach on a credible conversational speech recognition task involving the SWITCHBOARD (SWB) Corpus [3]. HIDDEN DYNAMIC MODELS 1. INTRODUCTIONHidden dynamic models [ 1,2] (HDMs) attempt to produce acoustic likelihoods of phone-level sound units that reflect intended spectral configurations rather than likelihoods based on the actual realization of the sound in the speech data. This is a radical departure from current statistical modeling approaches that attempt to account for variation in the data by accumulating large numbers of Gaussian mixture components. It is conjectured that this approach will 1.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.