Single‐channel dereverberation and denoising based on lower band trained SA‐LSTMs

Li, Yi; Sun, Yang; Naqvi, Syed Mohsen

doi:10.1049/iet-spr.2020.0134

Cited by 9 publications

(9 citation statements)

References 36 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, the proposed method denoises the speech mixture in a highly reverberant environment. Future work should be dedicated to exploit the dereverberation pretask [41], [42] to further refine the speech enhancement performance.…”

Section: Discussionmentioning

confidence: 99%

Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Li¹,

Sun²,

Naqvi³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recently, self-supervised learning (SSL) techniques have been introduced to solve the monaural speech enhancement problem. Due to the lack of using clean phase information, the enhancement performance is limited in most SSL methods. Therefore, in this paper, we propose a phase-aware self-supervised learning based monaural speech enhancement method. The latent representations of both amplitude and phase are studied in two decoders of the foundation autoencoder (FAE) with only a limited set of clean speech signals independently. Then, the downstream autoencoder (DAE) learns a shared latent space between the clean speech and mixture representations with a large number of unseen mixtures. A complex-cycle-consistent (CCC) mechanism is proposed to minimize the reconstruction loss between the amplitude and phase domains. Besides, it is noticed that if the speech features are extracted as the multi-resolution spectra, the desired information distributed in spectra of different scales can be studied to further boost the performance. The NOISEX and DAPS corpora are used to generate mixtures with different interferences to evaluate the efficacy of the proposed method. It is highlighted that the clean speech and mixtures fed in FAE and DAE are not paired. Both ablation and comparison experimental results show that the proposed method clearly outperforms the state-of-the-art approaches.

show abstract

Section: Discussionmentioning

confidence: 99%

Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Li¹,

Sun²,

Naqvi³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although most of the reverberations are removed by DM, the remaining reverberations in Ŷd still limit the performance [7]. Thus, in the second sub-layer, we exploit ERM in the second sub-layer to further improve the speech enhancement in reverberant environments, which can be defined as:…”

Section: Masking Modulementioning

confidence: 99%

“…Followed by our previous work [7], to further improve the speech enhancement performance, we introduce both the dereverberation mask (DM) and the estimated ratio mask (ERM) to provide the time-frequency relationships between the clean speech signal and the reverberant mixture. Hence, inspired by [8], we propose a multi pre-tasks SSL method which only needs a limited set of randomly selected clean speech signals and the corresponding mixture recordings in the pre-training.…”

Section: Introductionmentioning

confidence: 99%

Self-Supervised Learning based Monaural Speech Enhancement with Multi-Task Pre-Training

Li¹,

Sun²,

Naqvi³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In self-supervised learning, it is challenging to reduce the gap between the enhancement performance on the estimated and target speech signals with existed pre-tasks. In this paper, we propose a multi-task pre-training method to improve the speech enhancement performance with self-supervised learning. Within the pre-training autoencoder (PAE), only a limited set of clean speech signals are required to learn their latent representations. Meanwhile, to solve the limitation of single pre-task, the proposed masking module exploits the dereverberated mask and estimated ratio mask to denoise the mixture as the second pre-task. Different from the PAE, where the target speech signals are estimated, the downstream task autoencoder (DAE) utilizes a large number of unlabeled and unseen reverberant mixtures to generate the estimated mixtures. The trained DAE is shared by the learned representations and masks. Experimental results on a benchmark dataset demonstrate that the proposed method outperforms the state-of-the-art approaches.

show abstract

“…By using short-time Fourier transform (STFT), the state-of-the-art methods estimate the spectrogram of the desired speech signal from the mixture spectrogram (Kumawat and Raman 2020) (Pandey and Wang 2020). However, it has been confirmed that the background noise is uniformly distributed at the full band and human speech occupies in the lower frequency-band (Li, Sun, and Naqvi 2021). Thus, the whole T-F attention map is further divided into three sub attention maps, time attention (TA), high frequency-band attention (HFA), and low frequency-band attention (LFA).…”

Section: Introductionmentioning

confidence: 99%

U-shaped Transformer with Frequency-Band Aware Attention for Speech Enhancement

Li¹,

Sun²,

Naqvi³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

The state-of-the-art speech enhancement has limited performance in speech estimation accuracy. Recently, in deep learning, the Transformer shows the potential to exploit the longrange dependency in speech by self-attention. Therefore, it is introduced in speech enhancement to improve the speech estimation accuracy from a noise mixture. However, to address the computational cost issue in Transformer with selfattention, the axial attention is the option i.e., to split a 2D attention into two 1D attentions. Inspired by the axial attention, in the proposed method we calculate the attention map along both time-and frequency-axis to generate time and frequency sub-attention maps. Moreover, different from the axial attention, the proposed method provides two parallel multi-head attentions for time-and frequency-axis. Furthermore, it is proven in the literature that the lower frequencyband in speech, generally, contains more desired information than the higher frequency-band, in a noise mixture. Therefore, the frequency-band aware attention is proposed i.e., high frequency-band attention (HFA), and low frequency-band attention (LFA). The U-shaped Transformer is also first time introduced in the proposed method to further improve the speech estimation accuracy. The extensive evaluations over four public datasets, confirm the efficacy of the proposed method.

show abstract

Single‐channel dereverberation and denoising based on lower band trained SA‐LSTMs

Cited by 9 publications

References 36 publications

Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Self-Supervised Learning based Monaural Speech Enhancement with Complex-Cycle-Consistent

Self-Supervised Learning based Monaural Speech Enhancement with Multi-Task Pre-Training

U-shaped Transformer with Frequency-Band Aware Attention for Speech Enhancement

Contact Info

Product

Resources

About