LSTM-Based Whisper Detection

Raeesy, Zeynab; Gillespie, Kellen; Ma, Chengyuan; Drugman, Thomas; Gu, Jiacheng; Maas, Roland; Rastrow, Ariya; Hoffmeister, Björn

doi:10.1109/slt.2018.8639614

Cited by 15 publications

(10 citation statements)

References 12 publications

(10 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To train the whispered speech utterance detection model, we transform the raw audio into 64 dimensional log mel-filterbank coefficients in every 25 ms window with 10 ms fixed frame rate, which follows [7]. We note that this paper does not use the engineering feature as described in Section 1 to ease the computation burden.…”

Section: Methodsmentioning

confidence: 99%

“…The experiment examined three model architectures: MLP, LSTM and CNN. In [7], MLP and LSTM models were trained frame-by-frame and so, the output is the probability of each frame being whispered speech or not. Therefore, when judging whether an utterance as either whispered speech utterance or not, an inference module is needed.…”

Section: Methodsmentioning

confidence: 99%

“…Their techniques are based on the Gaussian Mixture Model (GMM) trained by engineered features, specifically spectral information entropy (SIE) from divided sub-frames and sub-bands and the SIE ratio between the high band and the low band. More recently, [7] introduced deep neural networks (DNNs) to carry out the frame-level classification of whispered/non-whispered speech. This work evaluated long short-term memory (LSTM) in a comparison with the simple multilayer perceptron (MLP) as a baseline; the former achieved a higher frame accuracy.…”

Section: Introductionmentioning

confidence: 99%

“…In this article, we apply one of the oversampling methods called class-aware sampling [9] as it is well suits the training of neural networks. We also compare CNNs that can directly optimize the utterance-level target with frame-level DNNs [7].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Neural Whispered Speech Detection with Imbalanced Learning

Ashihara¹,

Shinohara²,

Satō³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we present a neural whispered-speech detection technique that offers utterance-level classification of whispered and non-whispered speech exhibiting imbalanced data distributions. Previous studies have shown that machine learning models trained on a large amount of whispered and non-whispered utterances perform remarkably well for whispered speech detection. However, it is often difficult to collect large numbers of whispered utterances. In this paper, we propose a method to train neural whispered speech detectors from a small amount of whispered utterances in combination with a large amount of non-whispered utterances. In doing so, special care is taken to ensure that severely imbalanced datasets can effectively train neural networks. Specifically, we use a class-aware sampling method for training neural networks. To evaluate the networks, we gather test samples recorded by both condenser and smartphone microphones at different distances from the speakers to simulate practical environments. Experiments show the importance of imbalanced learning in enhancing the performance of utterance level classifiers.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Neural Whispered Speech Detection with Imbalanced Learning

Ashihara¹,

Shinohara²,

Satō³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

show abstract

“…As long short-term memory (LSTM) [10] leads to state-of-theart results in various speech related tasks, e.g. automatic speech recognition [4], keyword spotting [11], speaker identification [12], whisper detection [13], it is employed as a popular solution for AEC as well [14,15,16,17,18,19,20], typically combined with convolutional neural networks (CNNs) [21]. To run applications mentioned above on mobile devices or smart speakers, a model with small memory footprint is required.…”

Section: Introductionmentioning

confidence: 99%

A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Kao

Sun

Wang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Acoustic event classification (AEC) and acoustic event detection (AED) refer to the task of detecting whether specific target events occur in audios. As long short-term memory (LSTM) leads to stateof-the-art results in various speech related tasks, it is employed as a popular solution for AEC as well. This paper focuses on investigating the dynamics of LSTM model on AEC tasks. It includes a detailed analysis on LSTM memory retaining, and a benchmarking of nine different pooling methods on LSTM models using 1.7M generated mixture clips of multiple events with different signal-tonoise ratios. This paper focuses on understanding: 1) utterance-level classification accuracy; 2) sensitivity to event position within an utterance. The analysis is done on the dataset for the detection of rare sound events from DCASE 2017 Challenge. We find max pooling on the prediction level to perform the best among the nine pooling approaches in terms of classification accuracy and insensitivity to event position within an utterance. To authors' best knowledge, this is the first kind of such work focused on LSTM dynamics for AEC tasks.

show abstract

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Joysingh

Vijayalakshmi

Nagarajan

2022

Circuits Syst Signal Process

View full text Add to dashboard Cite

Human-computer interaction via speech is more common than ever before. Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications, such as automatic speech recognition, speaker identification, language identification, etc, even though there are more than a hundred thousand laryngectomees in the world who can only whisper. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope (QSE). The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients (MFCC), a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy (LFBE) features trained on long short-term memory (LSTM) network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.

show abstract

LSTM-Based Whisper Detection

Cited by 15 publications

References 12 publications

Neural Whispered Speech Detection with Imbalanced Learning

Neural Whispered Speech Detection with Imbalanced Learning

A Comparison of Pooling Methods on LSTM Models for Rare Acoustic Event Classification

Quartered Spectral Envelope and 1D-CNN-Based Classification of Normally Phonated and Whispered Speech

Contact Info

Product

Resources

About