2015
DOI: 10.1007/978-3-319-22482-4_11
|View full text |Cite
|
Sign up to set email alerts
|

Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR

Abstract: Abstract. We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'naïvely' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simpl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
339
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 495 publications
(340 citation statements)
references
References 18 publications
0
339
0
Order By: Relevance
“…For traditional speech enhancement techniques, which require either no training or training on the noise context preceding each test utterance (Cohen et al, 2010;Hurmalainen et al, 2013), the issue of mismatched noise conditions did not arise. This recently became a concern with the emergence of speech enhancement techniques based on deep neural networks (DNNs) Xu et al, 2014;Weninger et al, 2015), which require a larger amount of training data not limited to the immediate context. Chen et al (2015) and Kim and Smaragdis (2015) considered the problem of adapting DNN based enhancement to unseen test conditions, but their experiments were conducted on small, simulated datasets and evaluated in terms of enhancement metrics.…”
Section: Introductionmentioning
confidence: 99%
“…For traditional speech enhancement techniques, which require either no training or training on the noise context preceding each test utterance (Cohen et al, 2010;Hurmalainen et al, 2013), the issue of mismatched noise conditions did not arise. This recently became a concern with the emergence of speech enhancement techniques based on deep neural networks (DNNs) Xu et al, 2014;Weninger et al, 2015), which require a larger amount of training data not limited to the immediate context. Chen et al (2015) and Kim and Smaragdis (2015) considered the problem of adapting DNN based enhancement to unseen test conditions, but their experiments were conducted on small, simulated datasets and evaluated in terms of enhancement metrics.…”
Section: Introductionmentioning
confidence: 99%
“…In [40], the DNN is used to estimate the instantaneous SNR for computing IRM, subsequently applied to filter out noise from a noisy Mel spectrogram. The recurrent neural networks (RNNs), with their ability to model the temporal dependencies in speech, have also been employed to estimate the time-frequency masks from the magnitude spectrum of a noisy signal for speech enhancement and recognition [41]. For speech recognition applications, the speech enhancement approaches are typically exploited as front-end processing to reconstruct the clean version of the speech, which is then fed into a speech recognizer.…”
Section: Speech Enhancement Using Dnnmentioning
confidence: 99%
“…speech recognition [2] and speech synthesis [3]. More recently, the DNN has been applied to speech separation [4]- [7] and enhancement/denoising [8]- [10], particularly for monaural recordings [4]- [6], [8]- [10]. When processing mixtures of target speech signals and competing noise, speech separation may be considered as speech enhancement.…”
Section: Introductionmentioning
confidence: 99%
“…In order to recover the underlying target speech embedded in noise, most of the deep neural networks, either recurrent [4], [5], [10] or feedforward [4], [6], [8], [9], [11], are trained to optimize some objective functions such as the mean squared error (MSE) between the true and predicted outputs. The inputs to the DNN are often (hybrid) features such as timefrequency (TF) domain spectral features [4]- [6], [8]- [10] and filterbank features [4], [5], [11]; while the output can be the TF unit level features that can be used to recover the speech source, such as ideal binary/ratio masks (IBM/IRM) [4]- [6], [11], direct magnitude spectra [9], [10] or their transforms such as log power (LP) spectra [8].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation