Deep Speech Enhancement for Reverberated and Noisy Signals using Wide Residual Networks

Ribas, Dayana; Llombart, Jorge; Miguel, Antonio; Vicente, Luis

doi:10.48550/arxiv.1901.00660

Cited by 3 publications

(6 citation statements)

References 41 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to direct the RI2RI model to focus on the phase estimation task, it was pre-trained with a synthetic data constructed with clean magnitude and noisy phase. The pre-training of this model also utilizes (5).…”

Section: Training Objectivesmentioning

confidence: 99%

“…In recent years, deep neural network (DNN)-based models were utilized to deal with this challenge. The majority of these methods attempts to enhance the magnitude of the noisy and reveberant short time Fourier transform (STFT) [3][4][5]. In these approaches, the enhanced magnitude is combined with the noisy phase and then inverse-transformed to the timedomain.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Magnitude or Phase? A Two Stage Algorithm for Dereverberation

Schwartz¹,

Gannot²,

Chazan³

2022

Preprint

View full text Add to dashboard Cite

In this work we present a new single-microphone speech dereverberation algorithm. First, a performance analysis is presented to interpret that algorithms focused on improving solely magnitude or phase are not good enough. Furthermore, we demonstrate that few objective measurements have high correlation with the clean magnitude while others with the clean phase. Consequently ,we propose a new architecture which consists of two sub-models, each of which is responsible for a different task. The first model estimates the clean magnitude given the noisy input. The enhanced magnitude together with the noisy-input phase are then used as inputs to the second model to estimate the real and imaginary portions of the dereverberated signal. A training scheme including pretraining and fine-tuning is presented in the paper. We evaluate our proposed approach using data from the REVERB challenge and compare our results to other methods. We demonstrate consistent improvements in all measures, which can be attributed to the improved estimates of both the magnitude and the phase.

show abstract

Section: Training Objectivesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Magnitude or Phase? A Two Stage Algorithm for Dereverberation

Schwartz¹,

Gannot²,

Chazan³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Additionally, advances in technologies such as hearing aids require the speech systems to enhance perceptual quality of speech captured in adverse environmental conditions, thus improving human hearing abilities. Several deep learning (DL)-based speech enhancement systems have been successfully developed to address concurrent improvements in perceptual quality and performance of back-end speech and language applications using fully convolutional neural networks (FCN), and recurrent networks (RNN) [9,10,11,12]. The majority of these approaches work with the complex short-term fourier transform (STFT) of distorted speech, either to enhance the log-power spectrum (LPS) and reuse the unaltered distorted phase signal [13,14,15,16,17], or to estimate the complex ratio mask (cRM) [18,19,20] and directly enhance the complex spectrogram to restore a cleaner time-domain signal.…”

Section: Introductionmentioning

confidence: 99%

“…As deep neural networks (DNN) advance to be compatible with complex representations, researchers have investigated many speech enhancement strategies to estimate cRM using deep complex neural networks (DCNN). To address reverberation which distorts the signal in both time and frequency, many sequence-to-sequence learning strategies such as recurrent neural networks (RNNs) and long short-term memory (LSTM) [21,11] have also been explored. In addition to the FCNs, these methods capture and leverage the temporal correlations for speech dereverberation.…”

Section: Introductionmentioning

confidence: 99%

Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation

Kothapally¹,

Hansen²

2022

Interspeech 2022

View full text Add to dashboard Cite

Several speech processing systems have demonstrated considerable performance improvements when deep complex neural networks (DCNN) are coupled with self-attention (SA) networks. However, the majority of DCNN-based studies on speech dereverberation that employ self-attention do not explicitly account for the inter-dependencies between real and imaginary features when computing attention. In this study, we propose a complexvalued T-F attention (TFA) module that models spectral and temporal dependencies by computing two-dimensional attention maps across time and frequency dimensions. We validate the effectiveness of our proposed complex-valued TFA module with the deep complex convolutional recurrent network (DC-CRN) using the REVERB challenge corpus. Experimental findings indicate that integrating our complex-TFA module with DCCRN improves overall speech quality and performance of back-end speech applications, such as automatic speech recognition, compared to earlier approaches for self-attention.

show abstract

“…Frequency-domain features such as STFT, Gammatone spectrum and Mel-Frequency Cepstral Coefficients (MFCC) have been used frequently. In addition, a combination of STFT with MFCC is employed in [9] for training wide residual networks for speech enhancement. Compared to STFT, filterbased features like MFCC exhibit reduced dimensionality and are more suitable for learning algorithms, as they can reduce memory and computational requirements while maintaining comparable level of performance [7], [10], [11].…”

Section: Introductionmentioning

confidence: 99%

On the Use of Audio Fingerprinting Features for Speech Enhancement with Generative Adversarial Network

Faraji

Attabi

Champagne

et al. 2020

Preprint

View full text Add to dashboard Cite

The advent of learning-based methods in speech enhancement has revived the need for robust and reliable training features that can compactly represent speech signals while preserving their vital information. Time-frequency domain features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficients (MFCC), are preferred in many approaches. While the MFCC provide for a compact representation, they ignore the dynamics and distribution of energy in each mel-scale subband. In this work, a speech enhancement system based on Generative Adversarial Network (GAN) is implemented and tested with a combination of Audio FingerPrinting (AFP) features obtained from the MFCC and the Normalized Spectral Subband Centroids (NSSC). The NSSC capture the locations of speech formants and complement the MFCC in a crucial way. In experiments with diverse speakers and noise types, GAN-based speech enhancement with the proposed AFP feature combination achieves the best objective performance while reducing memory requirements and training time.

show abstract

Deep Speech Enhancement for Reverberated and Noisy Signals using Wide Residual Networks

Cited by 3 publications

References 41 publications

Magnitude or Phase? A Two Stage Algorithm for Dereverberation

Magnitude or Phase? A Two Stage Algorithm for Dereverberation

Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation

On the Use of Audio Fingerprinting Features for Speech Enhancement with Generative Adversarial Network

Contact Info

Product

Resources

About