Discriminatively trained recurrent neural networks for single-channel speech separation

Weninger, Felix; Hershey, John R.; Roux, Jonathan Le; Schuller, Björn

doi:10.1109/globalsip.2014.7032183

Cited by 249 publications

(269 citation statements)

References 20 publications

Supporting

Mentioning

266

Contrasting

Order By: Relevance

“…There, only the gradient ∂E SA /∂m of the objective function with respect to the network output is specific to source separation, whereas the rest of the algorithm is unchanged. Using L instead of conventional sigmoid or half-wave activation functions helps reducing the vanishing temporal gradient problem of RNNs [5], allowing them to outperform DNNs with static context windows in speech enhancement [16].…”

Section: Speech Enhancement Methodsmentioning

confidence: 99%

“…Here, we consider deep recurrent neural networks (DRNNs). as proposed in [16]. The maskm t is estimated by the DRNN forward pass, which is defined as follows, for hidden layers k = 1, .…”

Section: Speech Enhancement Methodsmentioning

confidence: 99%

“…There, the source separation problem is formulated as a regression task: determine a time-frequency mask for separating the wanted source, based on acoustic features such as the magnitude spectrogram. Due to their ability to capture the temporal dynamics of speech, RNNs have delivered particularly promising results in the context of regression-based speech enhancement [16,2]. In contrast, the performance of RNN-based speech recognition in noisy conditions is still limited when compared to feedforward deep neural network (DNN) based systems [3,15].…”

Section: Introductionmentioning

confidence: 99%

“…Supervised training of speech enhancement schemes is becoming increasingly popular especially in the context of single-channel speech enhancement in nonstationary noise [16,7]. There, the source separation problem is formulated as a regression task: determine a time-frequency mask for separating the wanted source, based on acoustic features such as the magnitude spectrogram.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Weninger¹,

Erdoğan

Watanabe

et al. 2015

Latent Variable Analysis and Signal Separation

Self Cite

494

339

View full text Add to dashboard Cite

Abstract. We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'naïvely' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76 % average word error rate, which is, to our knowledge, the best score to date.

show abstract

Section: Speech Enhancement Methodsmentioning

confidence: 99%

“…Here, we consider deep recurrent neural networks (DRNNs). as proposed in [16]. The maskm t is estimated by the DRNN forward pass, which is defined as follows, for hidden layers k = 1, .…”

Section: Speech Enhancement Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Weninger¹,

Erdoğan

Watanabe

et al. 2015

Latent Variable Analysis and Signal Separation

Self Cite

494

339

View full text Add to dashboard Cite

show abstract

“…Previous studies have focused on either developing improved feature extraction methods or using more sophisticated classifiers -for example moving from Gaussian mixture models (GMMs) to deep neural networks (DNNs). Some attention has been focussed on improving the classifier to reduce perceptual error by changing the loss function for text-to-speech applications [14], and introducing signal approximation loss functions [15,16] as a replacement for mask approximation within speech separation applications. Signal approximation loss functions apply the output of the network to the noisy spectrum within the loss function, and minimise this with respect to the target.…”

Section: Introductionmentioning

confidence: 99%

A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation

Websdale¹,

Milner²

2017

Interspeech 2017

View full text Add to dashboard Cite

This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit minus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is binary cross-entropy (CE), a standard loss function used in binary mask estimation, which maximises classification accuracy. We propose first a loss function that maximises the HIT-FA rate instead of classification accuracy. We then propose a second loss function that is a hybrid between CE and HIT-FA, providing a balance between classification accuracy and HIT-FA rate. Evaluations of the perceptually motivated loss functions with the GRID database show improvements to HIT-FA rate and ESTOI across babble and factory noises. Further tests then explore application of the perceptually motivated loss functions to a larger vocabulary dataset.

show abstract

Neural Decoding of Attentional Selection in Multi-speaker Environments Without Access to Clean Sources

O’Sullivan

Chen

Herrero

et al. 2020

SpringerBriefs in Electrical and Computer Engineering

View full text Add to dashboard Cite

Discriminatively trained recurrent neural networks for single-channel speech separation

Cited by 249 publications

References 20 publications

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation

Neural Decoding of Attentional Selection in Multi-speaker Environments Without Access to Clean Sources

Contact Info

Product

Resources

About