2018 IEEE Spoken Language Technology Workshop (SLT) 2018
DOI: 10.1109/slt.2018.8639614
|View full text |Cite
|
Sign up to set email alerts
|

LSTM-Based Whisper Detection

Abstract: This article presents a whisper speech detector in the far-field domain. The proposed system consists of a long-short term memory (LSTM) neural network trained on log-filterbank energy (LFBE) acoustic features. This model is trained and evaluated on recordings of human interactions with voicecontrolled, far-field devices in whisper and normal phonation modes. We compare multiple inference approaches for utterance-level classification by examining trajectories of the LSTM posteriors. In addition, we engineer a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
10
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 15 publications
(10 citation statements)
references
References 12 publications
(10 reference statements)
0
10
0
Order By: Relevance
“…To train the whispered speech utterance detection model, we transform the raw audio into 64 dimensional log mel-filterbank coefficients in every 25 ms window with 10 ms fixed frame rate, which follows [7]. We note that this paper does not use the engineering feature as described in Section 1 to ease the computation burden.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…To train the whispered speech utterance detection model, we transform the raw audio into 64 dimensional log mel-filterbank coefficients in every 25 ms window with 10 ms fixed frame rate, which follows [7]. We note that this paper does not use the engineering feature as described in Section 1 to ease the computation burden.…”
Section: Methodsmentioning
confidence: 99%
“…The experiment examined three model architectures: MLP, LSTM and CNN. In [7], MLP and LSTM models were trained frame-by-frame and so, the output is the probability of each frame being whispered speech or not. Therefore, when judging whether an utterance as either whispered speech utterance or not, an inference module is needed.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…As long short-term memory (LSTM) [10] leads to state-of-theart results in various speech related tasks, e.g. automatic speech recognition [4], keyword spotting [11], speaker identification [12], whisper detection [13], it is employed as a popular solution for AEC as well [14,15,16,17,18,19,20], typically combined with convolutional neural networks (CNNs) [21]. To run applications mentioned above on mobile devices or smart speakers, a model with small memory footprint is required.…”
Section: Introductionmentioning
confidence: 99%