Lattice-Based Improvements for Voice Triggering Using Graph Neural Networks

Dighe, Pranay; Adya, Saurabh; Li, Nuoyu; Vishnubhotla, Srikanth; Naik, Devang; Sagar, Adithya; Ma, Y.; Pulman, Stephen; Williams, Jason D.

doi:10.1109/icassp40776.2020.9053781

Cited by 5 publications

(9 citation statements)

References 13 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lattice embeddings are obtained by treating the lattice as a graph and processing it using multiple hidden layers of multi-headed self-attention operation. These embeddings have been shown to be highly informative for FTM task [12,1], but they can be obtained only by running full-fledged ASR on the audio which is expensive to be run on-device and invades user privacy in case of a false trigger. Moreover, the LatticeGNN model needs to be retrained if the distribution of the input lattice features changes due to any changes in the acoustic model, language model or the ASR decoding parameters.…”

Section: Latticegnn Ftm and Lattice Embeddingsmentioning

confidence: 99%

“…Other prior approaches for device-directed utterance detection includes various trigger-phrase detection techniques explored in [7,8,9,10,11]. Lattice-based techniques which complement trigger-phrase detection systems have been explored in [12,6,1].…”

Section: Introductionmentioning

confidence: 99%

“…In these cases, the smart assistant relies on other mechanisms than trigger-phrase detection to determine if the user intends to speak to the device or not. One such mechanism uses self-attention based graph neural networks [1,3] (called "LatticeGNN" hereafter) to extract information from ASR lattices to detect whether the user audio contains spoken content directed towards the device or not. While LatticeGNN ensures that falsely triggered audio is accurately identified, there are limitations in its usage.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Knowledge Transfer for Efficient on-Device False Trigger Mitigation

Dighe

Marchi

Vishnubhotla

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we address the task of determining whether a given utterance is directed towards a voice-enabled smart-assistant device or not. An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is essential for designing a privacycentric non-intrusive smart assistant. The directedness of an utterance can be identified by running automatic speech recognition (ASR) and determining the user intent by analyzing the ASR transcript. Yet, in case of a false trigger, transcribing the audio using ASR itself is strongly undesirable. To alleviate this issue, we propose an LSTM-based FTM architecture which determines the user intent from acoustic features directly without explicitly generating ASR transcripts from the audio. The proposed models are smallfootprint and can be run on-device with limited computational resources. During training, the model parameters are optimized using a knowledge transfer approach where a more accurate self-attention graph neural network model [1] serves as the teacher. Given the whole audio snippets, our approach mitigates 87% of false triggers at 99% true positive rate (TPR), and in a streaming audio scenario, the system listens to only 1.69s of the false trigger audio before rejecting it while achieving the same TPR.

show abstract

Section: Latticegnn Ftm and Lattice Embeddingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Knowledge Transfer for Efficient on-Device False Trigger Mitigation

Dighe

Marchi

Vishnubhotla

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…All our experiments are performed on an FTM dataset [8], which is composed of far field usage samples with manual labels of "true trigger" (TT) and "false trigger" (FT) classes. The raw audio data are split into train, cv, dev, and eval sets for the purposes of training, cross-validation, development and evaluation.…”

Section: Ftm Dataset and Evaluation Metricsmentioning

confidence: 99%

“…Thus a classifier built on top of the Bi-LRNN is able to mitigate the false trigger cases significantly. A recent work [8] explored the use of graph neural networks (GNN) to encode the decoding lattice, which achieves similar accuracy as the Bi-LRNN representation with more efficient training.…”

Section: Introductionmentioning

confidence: 99%

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

Agarwal¹,

Niu²,

Dighe

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

False triggers in voice assistants are unintended invocations of the assistant, which not only degrade the user experience but may also compromise privacy. False trigger mitigation (FTM) is a process to detect the false trigger events and respond appropriately to the user. In this paper, we propose a novel solution to the FTM problem by introducing a parallel ASR decoding process with a special language model trained from "out-ofdomain" data sources. Such language model is complementary to the existing language model optimized for the assistant task. A bidirectional lattice RNN (Bi-LRNN) classifier trained from the lattices generated by the complementary language model shows a 38.34% relative reduction of the false trigger (FT) rate at the fixed rate of 0.4% false suppression (FS) of correct invocations, compared to the current Bi-LRNN model. In addition, we propose to train a parallel Bi-LRNN model based on the decoding lattices from both language models, and examine various ways of implementation. The resulting model leads to further reduction in the false trigger rate by 10.8%.

show abstract