This work evaluates multi-microphone beamforming techniques and single-microphone spectral enhancement strategies to alleviate the reverberation effect for robust automatic speech recognition (ASR) systems in different reverberant environments characterized by different reverberation times T60 and direct-to- reverberation ratios (DRRs). The systems under test consist of minimum variance distortionless response (MVDR) beamformers in combination with minimum mean square error (MMSE) estimators. For the later, reliable late reverberation spectral variance (LRSV) estimation employing a generalized model of the room impulse response (RIR) is crucial. Based on the generalized RIR model which separates the direct path from the remaining RIR, two different frequency resolutions in the short time Fourier transform (STFT) domain are evaluated, referred to as short- and long-term, to effectively estimate the direct signal. Regarding to the fusion between the MVDR beamformer and the MMSE estimator, the LRSV estimator can operate either on the multi-channel observed speech signals or on the single-channel beamformer output. By this, in this contribution, four different combination system architectures are evaluated and analyzed with a focus on optimal ASR performance w.r.t. word error rate (WER).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.