Unsupervised Neural Mask Estimator for Generalized Eigen-Value Beamforming Based Asr

Kumar, Rohit; Sreeram, Anirudh; Purushothaman, Anurenjan; Ganapathy, Sriram

doi:10.1109/icassp40776.2020.9054550

“…The experiments are performed on REVERB challenge (Kinoshita et al, 2013) and CHiME-3 (Barker et al, 2015) datasets. For the baseline model, we use WPE enhancement (Nakatani et al, 2010) along with unsupervised GEV beamforming (Kumar et al, 2020). This signal is processed with filter-bank energy features (denoted as BF-FBANK).…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…One common approach to suppress reverberation is to combine all channels by beamforming (Anguera et al, 2007) before feeding it to the ASR system. Recently, unsupervised neural mask estimator for generalized eigen-value beamforming is proposed (Kumar et al, 2020). Traditional pre-possessing also includes the weighted prediction error (WPE) (Nakatani et al, 2010) based dereverberation along with the beamforming in most state-of-art far-field ASR systems.…”

Section: Introductionmentioning

confidence: 99%

Dereverberation of autoregressive envelopes for far-field speech recognition

Purushothaman

¹

,

Sreeram

²

,

Kumar

³

et al. 2022

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregressive estimation of the Hilbert envelopes. The neural dereverberation model estimates the envelope gain which when applied to reverberant signals suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition. Further, the sequence of steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can be implemented as a single neural processing pipeline which allows the joint learning of the dereverberation network and the acoustic model. Several experiments are performed on the REVERB challenge dataset, CHiME-3 dataset and VOiCES dataset. In these experiments, the joint learning of envelope dereverberation and acoustic model yields significant performance improvements over the baseline ASR system based on log-mel spectrogram as well as other past approaches for dereverberation (average relative improvements of 10-24% over the baseline system). A detailed analysis on the choice of hyper-parameters and the cost function involved in envelope dereverberation is also provided.

show abstract

“…The experiments are performed on REVERB challenge [13] and CHiME-3 [14] datasets. For the baseline model, we use WPE enhancement [8] along with unsupervised GEV beamforming [7]. This signal is processed with filterbank energy features (denoted as BF-FBANK).…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…One common approach to suppress reverberation is to combine all channels by beamforming [6] before feeding it to the ASR system. Recently, unsupervised neural mask estimator for generalized eigen-value beamforming is proposed [7].…”

Section: Introductionmentioning

confidence: 99%

Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

Purushothaman¹,

Sreeram²,

Kumar³

et al. 2021

Preprint

Self Cite

0

View full text Add to dashboard Cite

The task of speech recognition in far-field environments is adversely affected by the reverberant artifacts that elicit as the temporal smearing of the sub-band envelopes. In this paper, we develop a neural model for speech dereverberation using the long-term sub-band envelopes of speech. The sub-band envelopes are derived using frequency domain linear prediction (FDLP) which performs an autoregressive estimation of the Hilbert envelopes. The neural dereverberation model estimates the envelope gain which when applied to reverberant signals suppresses the late reflection components in the far-field signal. The dereverberated envelopes are used for feature extraction in speech recognition. Further, the sequence of steps involved in envelope dereverberation, feature extraction and acoustic modeling for ASR can be implemented as a single neural processing pipeline which allows the joint learning of the dereverberation network and the acoustic model. Several experiments are performed on the REVERB challenge dataset, CHiME-3 dataset and VOiCES dataset. In these experiments, the joint learning of envelope dereverberation and acoustic model yields significant performance improvements over the baseline ASR system based on log-mel spectrogram as well as other past approaches for dereverberation (average relative improvements of 10-24% over the baseline system). A detailed analysis on the choice of hyper-parameters and the cost function involved in envelope dereverberation is also provided.

show abstract

“…A common approach in multi-channel recording conditions is to use a weighted and delayed combination of the multiple channels using the technique called beamforming [4]. The current state-of-art approaches to beamforming use a neural mask estimator [5,6]. The speech and noise mask estimations are used to derive the power spectral density of the source and interfering signals for eigen value based beamforming [7].…”

Section: Introductionmentioning

confidence: 99%

End-to-End Speech Recognition With Joint Dereverberation Of Sub-Band Autoregressive Envelopes

Kumar¹,

Purushothaman²,

Sreeram³

et al. 2021

Preprint

Self Cite

0

View full text Add to dashboard Cite

The end-to-end (E2E) automatic speech recognition (ASR) offers several advantages over previous efforts for recognizing speech. However, in reverberant conditions, E2E ASR is a challenging task as the long-term sub-band envelopes of the reverberant speech are temporally smeared. In this paper, we develop a feature enhancement approach using a neural model operating on sub-band temporal envelopes. The temporal envelopes are modeled using the framework of frequency domain linear prediction (FDLP). The neural enhancement model proposed in this paper performs an envelope gain based enhancement of temporal envelopes. The model architecture consists of a combination of convolutional and long short term memory (LSTM) neural network layers. Further, the envelope dereverberation, feature extraction and acoustic modeling using transformer based E2E ASR can all be jointly optimized for the speech recognition task. The joint optimization ensures that the dereverberation model targets the ASR cost function. We perform E2E speech recognition experiments on the REVERB challenge dataset as well as on the VOiCES dataset. In these experiments, the proposed joint modeling approach yields significant improvements compared to baseline E2E ASR system (average relative improvements of 21% on the REVERB challenge dataset and about 10% on the VOiCES dataset).

show abstract

Unsupervised Neural Mask Estimator for Generalized Eigen-Value Beamforming Based Asr

Cited by 7 publications

References 15 publications

Dereverberation of autoregressive envelopes for far-field speech recognition

Dereverberation of autoregressive envelopes for far-field speech recognition

Dereverberation of Autoregressive Envelopes for Far-field Speech Recognition

End-to-End Speech Recognition With Joint Dereverberation Of Sub-Band Autoregressive Envelopes

Contact Info

Product

Resources

About