Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system

Heymann, Jahn; Drude, Lukas; Boeddeker, Christoph; Hanebrink, Patrick; Haeb‐Umbach, Reinhold

doi:10.1109/icassp.2017.7953173

Cited by 106 publications

(92 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…generate a single-channel output, within which the filters can be either fixed or adaptive depending on the model design. The second category, which we refer to as the masking-based (MB) beamforming, estimates the FaS beamforming filters in frequency domain by estimating time-frequency (T-F) masks for the sources of interest [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. The T-F masks specify the dominance of each T-F bin and are used to calculate the spatial covariance features required to obtain optimal weights for beamformers such as minimum variance distortionless response (MVDR) [24] and generalized eigenvalue (GEV) beamformer [25].…”

Section: Introductionmentioning

confidence: 99%

FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing

Luo

Han

Mesgarani

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

105

View full text Add to dashboard Cite

Beamforming has been extensively investigated for multi-channel audio processing tasks. Recently, learning-based beamforming methods, sometimes called neural beamformers, have achieved significant improvements in both signal quality (e.g. signal-to-noise ratio (SNR)) and speech recognition (e.g. word error rate (WER)). Such systems are generally non-causal and require a large context for robust estimation of inter-channel features, which is impractical in applications requiring low-latency responses. In this paper, we propose filter-and-sum network (FaSNet), a time-domain, filterbased beamforming approach suitable for low-latency scenarios. FaSNet has a two-stage system design that first learns frame-level time-domain adaptive beamforming filters for a selected reference channel, and then calculate the filters for all remaining channels. The filtered outputs at all channels are summed to generate the final output. Experiments show that despite its small model size, FaSNet is able to outperform several traditional oracle beamformers with respect to scale-invariant signal-to-noise ratio (SI-SNR) in reverberant speech enhancement and separation tasks. Moreover, when trained with a frequency-domain objective function on the CHiME-3 dataset, FaSNet achieves 14.3% relative word error rate reduction (RWERR) compared with the baseline model. These results show the efficacy of FaSNet particularly in reverberant and noisy signal conditions.

show abstract

Section: Introductionmentioning

confidence: 99%

FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing

Luo

Han

Mesgarani

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

105

View full text Add to dashboard Cite

show abstract

“…For the estimation of the RTFṽ, we used a method based on eigenvalue decomposition with noise covariance whitening [21,22], and apply it to the output of WPE dereverberation, to reduce the effect of reverberation and noise from the estimation. For estimation of noise spatial covariance matrices, we assumed that each utterance had noise-only periods of 225 ms and 75 ms, respectively, at its beginning and ending parts, for REVERB, and we used noise masks estimated by a BLSTM network [23] for CHiME3. Table 1 summarizes the WERs of the observed signals (Obs) and the enhanced signals obtained after the first estimation iteration.…”

Section: Estimation Of Power Spectral Density and Rtfmentioning

confidence: 99%

“…using (17), (18) and (21). Employing this in (15) we can express the convolutional beamformer coefficients as where we expressedḠ using (17) and (21), and q using (23).…”

Section: Appendix: Unified Versus Factorized Solutionmentioning

confidence: 99%

Jointly Optimal Dereverberation and Beamforming

Boeddeker

Nakatani

Kinoshita

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We previously proposed an optimal (in the maximum likelihood sense) convolutional beamformer that can perform simultaneous denoising and dereverberation, and showed its superiority over the widely used cascade of a Weighted Prediction Error (WPE) dereverberation filter and a conventional Minimum-Power Distortionless Response (MPDR) beamformer. However, it has not been fully investigated which components in the convolutional beamformer yield such superiority. To this end, this paper presents a new derivation of the convolutional beamformer that allows us to factorize it into a WPE dereverberation filter, and a special type of a (nonconvolutional) beamformer, referred to as a weighted MPDR (wM-PDR) beamformer, without loss of optimality. With experiments, we show that the superiority of the convolutional beamformer in fact comes from its wMPDR part.

show abstract

“…When a microphone array is available, ASR performance can be greatly improved by employing multi-channel speech enhancement (SE) pre-processing with an ASR back-end trained on multi-condition training (MCT) data. For example, the combination of neural-network (NN) based time-frequency mask estimation with beamforming has been employed by all top systems in recent distant ASR challenges [3,4]. It is worth mentioning that multi-channel SE can improve ASR performance even without any retraining of the ASR back-end on the enhanced speech, which may be possible because they introduce only a few distortions to the processed signals.…”

Section: Introductionmentioning

confidence: 99%

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Kinoshita

Ochiai

Delcroix

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

With the advent of deep learning, research on noise-robust automatic speech recognition (ASR) has progressed rapidly. However, ASR performance in noisy conditions of single-channel systems remains unsatisfactory. Indeed, most single-channel speech enhancement (SE) methods (denoising) have brought only limited performance gains over state-of-the-art ASR back-end trained on multicondition training data. Recently, there has been much research on neural network-based SE methods working in the time-domain showing levels of performance never attained before. However, it has not been established whether the high enhancement performance achieved by such time-domain approaches could be translated into ASR. In this paper, we show that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single-channel track of the CHiME-4 dataset. These positive results demonstrate that single-channel noise reduction can still improve ASR performance, which should open the door to more research in that direction.

show abstract

Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system

Cited by 106 publications

References 14 publications

FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing

FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing

Jointly Optimal Dereverberation and Beamforming

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

Contact Info

Product

Resources

About