Impact of single-microphone dereverberation on DNN-based meeting transcription systems

Yoshioka, Takuya; Xie, Changsheng; Gales, Mark J. F.

doi:10.1109/icassp.2014.6854660

Cited by 24 publications

(13 citation statements)

References 21 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Approaches to noise-robust speech recognition can generally be classified into two classes: front-end based and back-end based [1]. The front-end based approaches aim at removing distortions from the observations prior to recognition, and can either take place in time domain, spectral domain, or directly from the corrupted feature vectors [2,3]. The back-end approaches This work was supported by Samsung Electronics Co. Ltd, South Korea, under the project "Acoustic Model Adaptation toward Spontaneous Speech and Environment".…”

Section: Introductionmentioning

confidence: 99%

Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition

Himawan

Motlíček

Potard

et al. 2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Automatic speech recognition from distant microphones is a difficult task because recordings are affected by reverberation and background noise. First, the application of the deep neural network (DNN)/hidden Markov model (HMM) hybrid acoustic models for distant speech recognition task using AMI meeting corpus is investigated. This paper then proposes a feature transformation for removing reverberation and background noise artefacts from bottleneck features using DNN trained to learn the mapping between distant-talking speech features and close-talking speech bottleneck features. Experimental results on AMI meeting corpus reveal that the mismatch between close-talking and distant-talking conditions is largely reduced, with about 16% relative improvement over conventional bottleneck system (trained on close-talking speech). If the feature mapping is applied to close-talking speech, a minor degradation of 4% relative is observed.

show abstract

Section: Introductionmentioning

confidence: 99%

Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition

Himawan

Motlíček

Potard

et al. 2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…We apply de-reverberation based on the Weighted Prediction Error (WPE) algorithm [14,15] as front-end processing. This method is based on robust blind deconvolution using longterm linear prediction, with the motive of reducing the effects of the late reverberation.…”

Section: Wpe De-reverberationmentioning

confidence: 99%

“…As obtaining the actual noisy data is costly, the training data is artificially corrupted with reverberation and noise of different profiles. On the other hand, speech enhancement methods are used to reduce the interference in the speech signal either by de-reverberation [14,15,16] or noise reduction [17,13]. Moreover, the speech features can be engineered to alleviate the sensitivity to the recording environment [18,19,20], typically replacing the traditional nonlinearity in the mel scale with another power-law non-linearity, e.g.…”

Section: Introductionmentioning

confidence: 99%

The I2R’s ASR System for the VOiCES from a Distance Challenge 2019

Chong¹,

Tan²,

Teh³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

This paper describes the development of the automatic speech recognition (ASR) system for the submission to the VOiCES from a Distance Challenge 2019. In this challenge, we focused on the fixed condition, where the task is to recognize reverberant and noisy speech based on a limited amount of clean training data. In our system, the mismatch between the training and testing conditions was reduced by using multi-style training where the training data was artificially contaminated with different reverberation and noise sources. Also, the Weighted Prediction Error (WPE) algorithm was used to reduce the reverberant effect in the evaluation data. To boost the system performance, acoustic models of different neural network architectures were trained and the respective systems were fused to give the final output. Moreover, an LSTM language model was used to rescore the lattice to compensate the weak n-gram model trained from only the transcription text. Evaluated on the development set, our system showed an average word error rate (WER) of 27.04%.

show abstract

“…These approaches however can not be directly applied to DNNs because of the different structure of modeling parameters. Nevertheless, there have been some investigations of using feature-domain transform-based approaches such as feature-space MLLR (fMLLR) applied to DNNs [12,13,14]. Apart from speaker variabilities, variations in the audio recording process such as reverberations, speaker-to-microphone distance (e.g., close-talk or far-field), or recording devices can lead to significant differences in acoustic patterns.…”

Section: Introductionmentioning

confidence: 99%

Towards utterance-based neural network adaptation in acoustic modeling

Himawan

Motlíček

Font

et al. 2015

2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)

View full text Add to dashboard Cite

Despite the superior classification ability of deep neural networks (DNN), the performance of DNN suffers when there is a mismatch between training and testing conditions. Many speaker adaptation techniques have been proposed for DNN acoustic modeling but in case of environmental robustness the progress is still limited. It is also possible to use techniques developed for adapting speakers to handle the impact of environments at the same time, or to combine both approaches. Directly adapting the large number of DNN parameters is challenging when the adaptation set is small. The learning hidden unit contributions (LHUC) technique for unsupervised speaker adaptation of DNN introduces speaker dependent parameters to the existing speaker independent network to increase the automatic speech recognition (ASR) performance of the target speaker using small amounts of adaptation data. This paper investigates the LHUC to adapt the speech recognizer to target speakers and environments where the impacts of speakers and noise differences are quantified separately. Our finding shows that the LHUC is capable of adapting to both speaker and noise conditions at the same time. Compared to the speaker independent model, about 9% to 13% relative word error rate (WER) improvement are observed for all test conditions using AMI meeting corpus.

show abstract

Impact of single-microphone dereverberation on DNN-based meeting transcription systems

Cited by 24 publications

References 21 publications

Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition

Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition

The I2R’s ASR System for the VOiCES from a Distance Challenge 2019

Towards utterance-based neural network adaptation in acoustic modeling

Contact Info

Product

Resources

About