ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683565
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed

Abstract: Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first "wake-up" the VA by saying a particular word/phrase every time he or she wants the VA to do something. Eliminating the need for saying the wake-up word for every interaction could improve the user experience. This would require the VA to have the capability to detect the speech that is being directed at it and respond accordingly. In other words, the challeng… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
36
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 22 publications
(37 citation statements)
references
References 14 publications
(18 reference statements)
1
36
0
Order By: Relevance
“…This paper continues our previous work in [1] where only acoustic features were used for utterance classification. Specifically, we present two methods that incorporate non-acoustic information into our models to improve upon our previous acoustic-only-based performance ; the first incorporates ASR decoder features in addition to the usual acoustic features, while the second further adds word embeddings as inputs to the final classification stage of the model.…”
Section: Introductionsupporting
confidence: 57%
See 2 more Smart Citations
“…This paper continues our previous work in [1] where only acoustic features were used for utterance classification. Specifically, we present two methods that incorporate non-acoustic information into our models to improve upon our previous acoustic-only-based performance ; the first incorporates ASR decoder features in addition to the usual acoustic features, while the second further adds word embeddings as inputs to the final classification stage of the model.…”
Section: Introductionsupporting
confidence: 57%
“…Approaches for the classification of utterances into systemand non-system-directed ones typically use acoustic features extracted from the speech signal, e.g., [1,2,3,4,5]. Previous works [1,6,7] also show that using an attention mechanism *Author performed research herein as part of an internship-partnership program between Mila and Nuance. combined with a BiLSTM network can improve classification performance.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The CNNs or the BiLSTMs layers generate a sequence of vectors for the classification process [ 32 ]. The attention layer is used to convert the sequence of vectors (frames) into a context vector, which attends some parts of the input sequence [ 33 , 34 ]. Figure 2 illustrates the role of the attention layer in our approach.…”
Section: The Proposed Frameworkmentioning
confidence: 99%
“…The sequence of vectors (frames) produced from CNN or LSTM and forwarded to the attention layer to convert them into a context vector [ 23 , 28 , 29 ]. The attention weight are forwarded to Softmax function at time t to generate the probability of the frame out of one to the remaining frames in the same speech segment.…”
Section: The Proposed Frameworkmentioning
confidence: 99%