Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition

Yoshioka, Takuya; Sehr, Armin; Delcroix, Marc; Kinoshita, Keisuke; Maas, Roland; Nakatani, Tomohiro; Kellermann, Walter

doi:10.1109/msp.2012.2205029

Cited by 206 publications

(126 citation statements)

References 38 publications

Supporting

Mentioning

125

Contrasting

Order By: Relevance

“…The late reverberation part of the room impulse response is often modeled as an exponentially damped Gaussian noise process and treated as additive noise. Hence, the observed reverberant signal x(t) can be written by using the notation in [1] as…”

Section: Speech Enhancement Using Dnnmentioning

confidence: 99%

“…Automatic speech recognition from distant microphones is a challenging task, because the speech signals to be recognized are degraded by the presence of interfering signals and reverberation due to large speakerto-microphone distance [1]. The conventional multichannel enhancement techniques, such as beamforming, are widely employed to suppress noise and reverberation from the desired speech when multiple microphones (e.g., microphone arrays) are used to capture audio signals [2,3].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Feature mapping using far-field microphones for distant speech recognition

Himawan

Motlíček

Sridharan

2016

Speech Communication

View full text Add to dashboard Cite

Acoustic modeling based on deep architectures has recently gained remarkable success, with substantial improvement of speech recognition accuracy in several automatic speech recognition (ASR) tasks. For distant speech recognition, the multi-channel deep neural network based approaches rely on the powerful modeling capability of deep neural network (DNN) to learn suitable representation of distant speech directly from its multi-channel source. In this model-based combination of multiple microphones, features from each channel are concatenated and used together as an input to DNN. This allows integrating the multi-channel audio for acoustic modeling without any pre-processing steps. Despite powerful modeling capabilities of DNN, an environmental mismatch due to noise and reverberation may result in severe performance degradation when features are simply fed to a DNN without a feature enhancement step. In this paper, we introduce the nonlinear bottleneck feature mapping approach using DNN, to transform the noisy and reverberant features to its clean version. The bottleneck features trained on clean signal are used as a teacher signal because they contain relevant information to phoneme classification, and the mapping is performed with the objective of suppressing noise and reverberation. The individual and combined impacts of beamforming and speaker adaptation techniques along with the feature mapping are examined for distant large vocabulary speech recognition, using a single and multiple far-field microphones. As an alternative to beamforming, experiments with concatenating multiple channel features are conducted. The experimental results on the AMI meeting corpus show that the feature mapping, used in combination with beamforming and speaker adaptation yields a distant speech recognition performance below 50% word error rate (WER), using DNN for acoustic modeling.

show abstract

Section: Speech Enhancement Using Dnnmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Feature mapping using far-field microphones for distant speech recognition

Himawan

Motlíček

Sridharan

2016

Speech Communication

View full text Add to dashboard Cite

show abstract

“…We employ the STFT-domain dereverberation algorithm that was first proposed in [12] for a two-microphone one-output case and generalized later in [14]. The single-channel version is briefly described in [11] and in the following.…”

Section: Front Endsmentioning

confidence: 99%

“…In particular, we employ a single distant microphone (SDM) setup, where only speech data from a single table-top microphone are available. As a result of the large distance between the microphone and the speakers, speech signals are contaminated by reverberation, thus making transcription very challenging [11]. To combat the reverberant distortion, we employ one exemplary dereverberation method proposed in [12] and experimentally investigate how it can affect the performance of DNN-based acoustic models for both speaker independent (SI) and speaker adaptive training (SAT) scenarios.…”

Section: Introductionmentioning

confidence: 99%

Impact of single-microphone dereverberation on DNN-based meeting transcription systems

Yoshioka

Xie

Gales

2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Over the past few decades, a range of front-end techniques have been proposed to improve the robustness of automatic speech recognition systems against environmental distortion. While these techniques are effective for small tasks consisting of carefully designed data sets, especially when used with a classical acoustic model, there has been limited evidence that they are useful for a state-of-theart system with large scale realistic data. This paper focuses on reverberation as a type of distortion and investigates the degree to which dereverberation processing can improve the performance of various forms of acoustic models based on deep neural networks (DNNs) in a challenging meeting transcription task using a single distant microphone. Experimental results show that dereverberation improves the recognition performance regardless of the acoustic model structure and the type of the feature vectors input into the neural networks, providing additional relative improvements of 4.7% and 4.1% to our best configured speaker-independent and speakeradaptive DNN-based systems, respectively.

show abstract

“…The performance of existing models trained with anechoic speech signals can deteriorate when the person talking to the robot is located a few metres away [6]. Thus far, many algorithms for ASR in reverberant rooms have been developed with a focus mainly on spectrum enhancement, feature enhancement, hidden Markov model (HMM) adaptation and reverberant modeling during speech recognition [7]. Existing research uses multiple channel input [8][9] to deal with background noise or simultaneous speech.…”

Section: Introductionmentioning

confidence: 99%

Robust speech recognition in reverberant environments by using an optimal synthetic room impulse response model

Liu

Yang

2015

Speech Communication

View full text Add to dashboard Cite

This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reverberant environments based on multi-model selection. Multiple ASR models are trained with artificial synthetic room impulse responses (IRs), i.e. simulated room IRs, with different reverberation time (T Model 60 s) and tested on real room IRs with varying T Room 60 s. To apply our method, the biggest challenge is to choose a proper artificial room IR model for training ASR models. In this paper, a generalised statistical IR model with attenuated reverberation after an early reflection period, named attenuated IR model, has been adopted based on three time-domain statistical IR models. Its optimal values of the reverberation-attenuation factor and the early reflection period on the recognition rate have been searched and determined. Extensive testing has been performed over four real room IR sets (63 IRs in total) with variant T Room 60 s and speaker microphone distances (SMDs). The optimised attenuated IR model had the best performance in terms of recognition rate over others. Specific considerations of the practical use of the method have been taken into account including: i) the maximal training step of T Model 60 in order to get the minimal number of models with acceptable performance; ii) the impact of selection errors on the ASR caused by the estimation error of T Room 60; and iii) the performance over SMD and direct-toreverberation energy Ratio (DRR). It is shown that recognition rates of over

show abstract

Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition

Cited by 206 publications

References 38 publications

Feature mapping using far-field microphones for distant speech recognition

Feature mapping using far-field microphones for distant speech recognition

Impact of single-microphone dereverberation on DNN-based meeting transcription systems

Robust speech recognition in reverberant environments by using an optimal synthetic room impulse response model

Contact Info

Product

Resources

About