An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Vincent, Emmanuel; Watanabe, Shinji; Nugraha, Aditya Arie; Barker, Jon; Marxer, Ricard

doi:10.1016/j.csl.2016.11.005

Cited by 292 publications

(185 citation statements)

References 57 publications

(88 reference statements)

Supporting

Mentioning

184

Contrasting

Order By: Relevance

“…The proposed multi-span AM was evaluated by training systems on CHiME4 [28] and AMI [29] using HTK 3.5.1 and PyHTK [30,31]. In the results reported here, the multi-span feature vector p of the concatenated input streams is fed into a simple feed forward DNN with 4 hidden layers each having 512 output nodes and ReLU activation function.…”

Section: Methodsmentioning

confidence: 99%

Multi-Span Acoustic Modelling Using Raw Waveform Signals

2019

View full text Add to dashboard Cite

Traditional automatic speech recognition (ASR) systems often use an acoustic model (AM) built on handcrafted acoustic features, such as log Mel-filter bank (FBANK) values. Recent studies found that AMs with convolutional neural networks (CNNs) can directly use the raw waveform signal as input. Given sufficient training data, these AMs can yield a competitive word error rate (WER) to those built on FBANK features. This paper proposes a novel multi-span structure for acoustic modelling based on the raw waveform with multiple streams of CNN input layers, each processing a different span of the raw waveform signal. Evaluation on both the single channel CHiME4 and AMI data sets show that multi-span AMs give a lower WER than FBANK AMs by an average of about 5% (relative). Analysis of the trained multi-span model reveals that the CNNs can learn filters that are rather different to the log Melfilters. Furthermore, the paper shows that a widely used single span raw waveform AM can be improved by using a smaller CNN kernel size and increased stride to yield improved WERs.

show abstract

Section: Methodsmentioning

confidence: 99%

Multi-Span Acoustic Modelling Using Raw Waveform Signals

2019

View full text Add to dashboard Cite

show abstract

“…These five types of noise are also used in the noise-depend evaluation. For noise-independent evaluation, we use 5 different noises from different datasets: pedestrian, cafe, street noises from CHiME-4 [26] dataset and factory2, tank (m109) from NOISEX-92. These noises are all highly non-stationary, which makes speech enhancement be a challenging task.…”

Section: A Experimental Setupmentioning

confidence: 99%

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks

Hui

Zhang

et al. 2019

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

In recent decades, neural network based methods have significantly improved the performace of speech enhancement. Most of them estimate time-frequency (T-F) representation of target speech directly or indirectly, then resynthesize waveform using the estimated T-F representation. In this work, we proposed the temporal convolutional recurrent network (TCRN), an end-to-end model that directly map noisy waveform to clean waveform. The TCRN, which is combined convolution and recurrent neural network, is able to efficiently and effectively leverage short-term ang long-term information. Futuremore, we present the architecture that repeatedly downsample and upsample speech during forward propagation. We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks. Futuremore, We present several key techniques to stabilize the training process. The experimental results show that our model consistently outperforms existing speech enhancement approaches, in terms of speech intelligibility and quality.

show abstract

“…This estimator cannot be computed in practice when the clean data is unknown, but it provides a lower bound on the word error rate (WER) achievable via uncertainty decoding. For real CHiME-3 data, we computed the OU using "pseudo-clean" features obtained by least-squares subband filtering of the noisy signals using the signal recorded by a close-talk microphone as a reference, as described in [29].…”

Section: Oracle Uncertainty (Ou) Estimatormentioning

confidence: 99%

An extended experimental investigation of DNN uncertainty propagation for noise robust ASR

Nathwani

Morales-Cordovilla

Sivasankaran

et al. 2017

2017 Hands-Free Speech Communications and Microphone Arrays (HSCMA)

Self Cite

View full text Add to dashboard Cite

Automatic speech recognition (ASR) in noisy environments remains a challenging goal. Recently, the idea of estimating the uncertainty about the features obtained after speech enhancement and propagating it to dynamically adapt deep neural network (DNN) based acoustic models has raised some interest. However, the results in the literature were reported on simulated noisy datasets for a limited variety of uncertainty estimators. We found that they vary significantly in different conditions. Hence, the main contribution of this work is to assess DNN uncertainty decoding performance for different data conditions and different uncertainty estimation/propagation techniques. In addition, we propose a neural network based uncertainty estimator and compare it with other uncertainty estimators. We report detailed ASR results on the CHiME-2 and CHiME-3 datasets. We find that, on average, uncertainty propagation provides similar relative improvement on real and simulated data and that the proposed uncertainty estimator performs significantly better than the one in [1]. We also find that the improvement is consistent, but it depends on the signal-to-noise ratio (SNR) and the noise environment.

show abstract

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

Cited by 292 publications

References 57 publications

Multi-Span Acoustic Modelling Using Raw Waveform Signals

Multi-Span Acoustic Modelling Using Raw Waveform Signals

Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks

An extended experimental investigation of DNN uncertainty propagation for noise robust ASR

Contact Info

Product

Resources

About