Deep neural network based spectral feature mapping for robust speech recognition

Han, Ke; He, Yanzhang; Bagchi, Deblin; Fosler‐Lussier, Eric; Wang, DeLiang

doi:10.21437/interspeech.2015-536

Cited by 34 publications

(17 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With this spectral feature mapping (SFM) approach, we can pass the output of our enhancement model directly to the ASR model (Figure 1). While deep learning has previously been applied to SFM for ASR [17,18,19], our work is the first to use GANs for this task. Michelsanti et al [20] employ GANs for SFM, but target speaker verification rather than ASR.…”

Section: Introductionmentioning

confidence: 99%

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Donahue

Prabhavalkar

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

186

124

View full text Add to dashboard Cite

We investigate the effectiveness of generative adversarial networks (GANs) for speech enhancement, in the context of improving noise robustness of automatic speech recognition (ASR) systems. Prior work [1] demonstrates that GANs can effectively suppress additive noise in raw waveform speech signals, improving perceptual quality metrics; however this technique was not justified in the context of ASR. In this work, we conduct a detailed study to measure the effectiveness of GANs in enhancing speech contaminated by both additive and reverberant noise. Motivated by recent advances in image processing [2], we propose operating GANs on log-Mel filterbank spectra instead of waveforms, which requires less computation and is more robust to reverberant noise. While GAN enhancement improves the performance of a clean-trained ASR system on noisy speech, it falls short of the performance achieved by conventional multi-style training (MTR). By appending the GAN-enhanced features to the noisy inputs and retraining, we achieve a 7% WER improvement relative to the MTR system.

show abstract

Section: Introductionmentioning

confidence: 99%

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Donahue

Prabhavalkar

2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

186

124

View full text Add to dashboard Cite

show abstract

“…Spectral mapping has been used to generate clean speech signals. However, in [8,7] they use only a local learning objective. Student-teacher networks have been used to improve the quality of noisy speech recognition [16,17,18].…”

Section: Prior Workmentioning

confidence: 99%

“…We train a DNN-based spectral mapper for feature denoising. In our previous work [7,8], we have shown that a DNN-based spectral mapper, which takes noisy spectrogram as input to predict clean filterbank features for ASR, yields good results on the CHiME-2 noisy and reverberant dataset. Specifically, we first divide the input time-domain signals into 25-ms frames with a 10-ms frame shift, and then apply short time Fourier transform (STFT) to compute log spectral magnitudes in each time frame.…”

Section: Spectral Mappingmentioning

confidence: 99%

See 1 more Smart Citation

Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition

Bagchi

Plantinga

Stiff

et al. 2018

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to ensure de-noised speech is useful for downstream tasks like ASR. We first train a spectral classifier on clean speech to predict senone labels. Then, the spectral classifier is joined with our speech enhancer as a noisy speech recognizer. This model is taught to imitate the output of the spectral classifier alone on clean speech. This mimic loss is combined with the traditional local criterion to train the speech enhancer to produce de-noised speech. Feeding the de-noised speech to an offthe-shelf Kaldi training recipe for the CHiME-2 corpus shows significant improvements in WER.

show abstract

“…Traditional signal processing-based methods, such as the Wiener filtering and Spectral Subtraction, among many others, provide noise reduction based on signal processing algorithms. More recently, Deep Neural Networks (DNN) ⋆ Supported by the University of Costa Rica have been presented in [6,7,8,9]. The main approach for DNN is the mapping of spectral features from noisy speech into the features of the corresponding clean speech.…”

Section: Introductionmentioning

confidence: 99%

Supervised Initialization of LSTM Networks for Fundamental Frequency Detection in Noisy Speech Signals

Coto-Jiménez¹

2019

Preprint

View full text Add to dashboard Cite

Fundamental frequency is one of the most important parameters of human speech, of importance for the classification of accent, gender, speaking styles, speaker identification, age, among others. The proper detection of this parameter remains as an important challenge for severely degraded signals. In previous references for detecting fundamental frequency in noisy speech using deep learning, the networks, such as Long Short-term Memory (LSTM) has been initialized with random weights, and then trained following a back-propagation through time algorithm. In this work, a proposal for a more efficient initialization, based on a supervised training using an Auto-associative network, is presented. This initialization is a better starting point for the detection of fundamental frequency in noisy speech. The advantages of this initialization are noticeable using objective measures for the accuracy of the detection and for the training of the networks, under the presence of additive white noise at different signal-to-noise levels.

show abstract

Deep neural network based spectral feature mapping for robust speech recognition

Cited by 34 publications

References 19 publications

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Spectral Feature Mapping with MIMIC Loss for Robust Speech Recognition

Supervised Initialization of LSTM Networks for Fundamental Frequency Detection in Noisy Speech Signals

Contact Info

Product

Resources

About