Robust speech recognition in reverberant environments by using an optimal synthetic room impulse response model

Liu, Jindong; Yang, Guang‐Zhong

doi:10.1016/j.specom.2014.11.004

Cited by 12 publications

(4 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, several room acoustic parameters have been applied in different dereverberation methods to suppress the reverberation in the signal. C 50 is used in [9] [10] and T 60 in [11] [12] to select the ASR acoustic model that better represents the reverberant conditions of the input utterance. In [13] T 60 is used to add to the current hidden Markov model state the contribution of previous states by applying a piecewise energy decay curve that is separated in early reflections and late reverberation contributions.…”

Section: Introductionmentioning

confidence: 99%

A Single-Channel Non-Intrusive C50 Estimator Correlated With Speech Recognition Performance

Parada

Sharma

Laínez

et al. 2016

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-Several intrusive measures of reverberation can be computed from measured and simulated room impulse responses, over the full frequency band or for each individual mel-frequency subband. It is initially shown that full-band clarity index C50 is the most correlated measure on average with reverberant speech recognition performance. This corroborates previous findings but now for the dataset to be used in this study. We extend the previous findings to show that C50 also exhibits the highest mutual information on average. Motivated by these extended findings, a non-intrusive room acoustic (NIRA) estimation method is proposed to estimate C50 from only the reverberant speech signal. The NIRA method is a data-driven approach based on computing a number of features from the speech signal and it employs these features to train a model used to perform the estimation. The choice of features and learning techniques are explored in this work using an evaluation set which comprises approximately 100000 different reverberant signals (around 93 hours of speech) including reverberation from measured and simulated room impulse responses. The feature importance of each feature with respect to the estimation of the target C50 is analysed following two different approaches. In both cases the newly chosen set of features shows high importance for the target. The best C50 estimator provides a root mean square deviation around 3 dB on average for all reverberant test environments.

show abstract

Section: Introductionmentioning

confidence: 99%

A Single-Channel Non-Intrusive C50 Estimator Correlated With Speech Recognition Performance

Parada

Sharma

Laínez

et al. 2016

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…These results are an important extension from our previous work in static SSL and support the robustness of the system to the sound dynamics in real-world environments. Furthermore, our system can be easily integrated with recent methods to enhance ASR in reverberant environments [55]- [57] without adding computational cost. This is the intrinsic advantage of embodied embedded cognition.…”

Section: Discussionmentioning

confidence: 99%

Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Dávila-Chacón

Liu

Wermter

2019

IEEE Trans. Neural Netw. Learning Syst.

Self Cite

View full text Add to dashboard Cite

Inspired by the behavior of humans talking in noisy environments, we propose an embodied embedded cognition approach to improve automatic speech recognition (ASR) systems for robots in challenging environments, such as with ego noise, using binaural sound source localization (SSL). The approach is verified by measuring the impact of SSL with a humanoid robot head on the performance of an ASR system. More specifically, a robot orients itself toward the angle where the signal-to-noise ratio (SNR) of speech is maximized for one microphone before doing an ASR task. First, a spiking neural network inspired by the midbrain auditory system based on our previous work is applied to calculate the sound signal angle. Then, a feedforward neural network is used to handle high levels of ego noise and reverberation in the signal. Finally, the sound signal is fed into an ASR system. For ASR, we use a system developed by our group and compare its performance with and without the support from SSL. We test our SSL and ASR systems on two humanoid platforms with different structural and material properties. With our approach we halve the sentence error rate with respect to the common downmixing of both channels. Surprisingly, the ASR performance is more than two times better when the angle between the humanoid head and the sound source allows sound waves to be reflected most intensely from the pinna to the ear microphone, rather than when sound waves arrive perpendicularly to the membrane.

show abstract

“…The image model method, first proposed in [ 19 ], is the most widespread among the latter. Alternatively, statistical methods [ 20 ] or methods based on geometric acoustics and ray tracing [ 21 ] can be used. To create realistic sound signals in this work, the image model method was used in the implementation of Lehman, Johansson and Nordholm [ 22 , 23 ].…”

Section: Methodsmentioning

confidence: 99%

Study of Generalized Phase Spectrum Time Delay Estimation Method for Source Positioning in Small Room Acoustic Environment

Faerman

Avramchuk

Voevodin

et al. 2022

Sensors

View full text Add to dashboard Cite

This paper considers the application of signal processing methods to passive indoor positioning with acoustics microphones. The key aspect of this problem is time-delay estimation (TDE) that is used to get the time difference of arrival of the source’s signal between the pair of distributed microphones. This paper studies the approach based on generalized phase spectrum (GPS) TDE methods. These methods use frequency-domain information about the received signals that make them different from widely applied generalized cross-correlation (GCC) methods. Despite the more challenging implementation, GPS TDE methods can be less demanding on computational resources and memory than conventional GCC ones. We propose an algorithmic implementation of a GPS estimator and study the various frequency weighting options in applications to TDE in a small room acoustic environment. The study shows that the GPS method is a reliable option for small acoustically dead rooms and could be effectively applied in presence of moderate in-band noises. However, GPS estimators are far less efficient in less acoustically dead environments, where other TDE options should be considered. The distinguishing feature of the proposed solution is the ability to get the time delay using a limited number of the adjusted bins. The solution could be useful for passively locating moving emitters of narrow-band continual noises using computationally simple frequency detection algorithms.

show abstract

Robust speech recognition in reverberant environments by using an optimal synthetic room impulse response model

Cited by 12 publications

References 27 publications

A Single-Channel Non-Intrusive C50 Estimator Correlated With Speech Recognition Performance

A Single-Channel Non-Intrusive C50 Estimator Correlated With Speech Recognition Performance

Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Study of Generalized Phase Spectrum Time Delay Estimation Method for Source Positioning in Small Room Acoustic Environment

Contact Info

Product

Resources

About