Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising

Williamson, Donald S.; Wang, DeLiang

doi:10.1109/taslp.2017.2696307

Cited by 197 publications

(112 citation statements)

References 39 publications

Supporting

Mentioning

102

Contrasting

Order By: Relevance

“…The loss function used for the proposed method, Eqs. (5) and (6), was also used for the conventional method. DNN in the proposed and conventional methods were trained 300 epochs where each epoch contained 2893 utterances which were randomly selected from the train set, and mini-batch size was 1.…”

Section: Dnn Architecture Loss Function and Training Setupmentioning

confidence: 99%

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Takeuchi

Yatabe

Oikawa

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We propose an end-to-end speech enhancement method with trainable time-frequency (T-F) transform based on invertible deep neural network (DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform (STFT), and estimates a T-F mask using DNN. On the other hand, some methods have considered end-to-end networks which directly estimate the enhanced signals without T-F transform. While end-to-end methods have shown promising results, they are black boxes and hard to understand. Therefore, some end-to-end methods used a DNN to learn the linear T-F transform which is much easier to understand. However, the learned transform may not have a

show abstract

Section: Dnn Architecture Loss Function and Training Setupmentioning

confidence: 99%

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Takeuchi

Yatabe

Oikawa

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…They are chosen in this way so that the scale of all the terms is almost the same. The regularization term for the generator is cosine similarity loss instead of L1 as widely used in other GAN methods [4,25]. We add a Gaussian noise with mean 0.0 and variance 0.01 between the encoder and the decoder of the generator.…”

Section: Generative Model (Gan)mentioning

confidence: 99%

Coarse-to-Fine Optimization for Speech Enhancement

Yao¹,

Al-Dahle²

2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we propose the coarse-to-fine optimization for the task of speech enhancement. Cosine similarity loss [1] has proven to be an effective metric to measure similarity of speech signals. However, due to the large variance of the enhanced speech with even the same cosine similarity loss in high dimensional space, a deep neural network learnt with this loss might not be able to predict enhanced speech with good quality. Our coarse-to-fine strategy optimizes the cosine similarity loss for different granularities so that more constraints are added to the prediction from high dimension to relatively low dimension. In this way, the enhanced speech will better resemble the clean speech. Experimental results show the effectiveness of our proposed coarse-to-fine optimization in both discriminative models and generative models. Moreover, we apply the coarse-tofine strategy to the adversarial loss in generative adversarial network (GAN) and propose dynamic perceptual loss, which dynamically computes the adversarial loss from coarse resolution to fine resolution. Dynamic perceptual loss further improves the accuracy and achieves state-of-the-art results compared with other generative models.

show abstract

“…This approach combines the flexibility of unsupervised NMF-based speech enhancement requiring no prior knowledge of differences between speech and noise characteristics, with online operation allowing for real-time use. RT-GCC-NMF generalizes to unseen speakers, acoustic environments, and recording setups from very little unlabeled training data: on the order of one thousand 64 ms frames, compared to hours of labeled training data required for deep learning approaches [3]. The pre-learned NMF dictionary is also very fast to train, on the order of seconds or minutes, in contrast with hours required to train deep neural networks.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Low Latency Speech Enhancement With RT-GCC-NMF

Wood

Rouat

2019

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

In this paper, we present RT-GCC-NMF: a realtime (RT), two-channel blind speech enhancement algorithm that combines the non-negative matrix factorization (NMF) dictionary learning algorithm with the generalized cross-correlation (GCC) spatial localization method. Using a pre-learned universal NMF dictionary, RT-GCC-NMF operates in a frame-by-frame fashion by associating individual dictionary atoms to target speech or background interference based on their estimated time-delay of arrivals (TDOA). We evaluate RT-GCC-NMF on two-channel mixtures of speech and real-world noise from the Signal Separation and Evaluation Campaign (SiSEC). We demonstrate that this approach generalizes to new speakers, acoustic environments, and recording setups from very little training data, and outperforms all but one of the algorithms from the SiSEC challenge in terms of overall Perceptual Evaluation methods for Audio Source Separation (PEASS) scores and compares favourably to the ideal binary mask baseline. Over a wide range of input SNRs, we show that this approach simultaneously improves the PEASS and signal to noise ratio (SNR)-based Blind Source Separation (BSS) Eval objective quality metrics as well as the short-time objective intelligibility (STOI) and extended STOI (ESTOI) objective speech intelligibility metrics. A flexible, soft masking function in the space of NMF activation coefficients offers real-time control of the trade-off between interference suppression and target speaker fidelity. Finally, we use an asymmetric short-time Fourier transform (STFT) to reduce the inherent algorithmic latency of RT-GCC-NMF from 64 ms to 2 ms with no loss in performance. We demonstrate that latencies within the tolerable range for hearing aids are possible on current hardware platforms.

show abstract

Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising

Cited by 197 publications

References 39 publications

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Invertible DNN-Based Nonlinear Time-Frequency Transform for Speech Enhancement

Coarse-to-Fine Optimization for Speech Enhancement

Unsupervised Low Latency Speech Enhancement With RT-GCC-NMF

Contact Info

Product

Resources

About