DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Hu, Yi; Liu, Yun; Lv, Shubo; Xing, Mengtao; Zhang, Shimin; Fu, Yihui; Wu, Jian; Zhang, Bihong; Xie, Lei

doi:10.48550/arxiv.2008.00264

Cited by 65 publications

(120 citation statements)

References 34 publications

(44 reference statements)

Supporting

Mentioning

119

Contrasting

Unclassified

Order By: Relevance

“…The decoder predicts a complex ratio mask M = Cat(M r , M i ) ∈ R T ×2F , where M r and M i represent the real and imaginary parts of mask. We use the mask applying scheme of DCCRN-E [3], which is called Mask Apply E in Fig. 1,…”

Section: Coarse Enhancement Modulementioning

confidence: 99%

See 1 more Smart Citation

HGCN: Harmonic gated compensation network for speech enhancement

Wang¹,

Zhu²,

Gao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Mask processing in the time-frequency (T-F) domain through the neural network has been one of the mainstreams for single-channel speech enhancement. However, it is hard for most models to handle the situation when harmonics are partially masked by noise. To tackle this challenge, we propose a harmonic gated compensation network (HGCN). We design a high-resolution harmonic integral spectrum to improve the accuracy of harmonic locations prediction. Then we add voice activity detection (VAD) and voiced region detection (VRD) to the convolutional recurrent network (CRN) to filter harmonic locations. Finally, the harmonic gating mechanism is used to guide the compensation model to adjust the coarse results from CRN to obtain the refinedly enhanced results. Our experiments show HGCN achieves substantial gain over a number of advanced approaches in the community.

show abstract

Section: Coarse Enhancement Modulementioning

confidence: 99%

“…T models process the waveform directly to obtain the target speech [1]. T-F models precess the spectrum after the short-time fast Fourier transform (STFT) [2][3][4]. Generally speaking, for speech enhancement, it's the T-F structure of speech that is enhanced.…”

Section: Introductionmentioning

confidence: 99%

HGCN: Harmonic gated compensation network for speech enhancement

Wang¹,

Zhu²,

Gao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, complex multiplication capture rotation in the complex domain and can easily manipulate the signal phase. Thus, complex neural networks are found to be more effective for applications such as wireless communication (Marseet and Sahin 2017) and noise suppression (Hu et al 2020;Bassey, Qian, and Li 2021). Compared to realvalued networks, complex representation also restricts the degree of freedom of the parameters by enforcing correlation between the real and imaginary parts, which enhances the generalization capacity of the model in other applications.…”

Section: Hybridbeam Architecturementioning

confidence: 99%

Hybrid Neural Networks for On-device Directional Hearing

Wang¹,

Kim²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present Hybrid-Beam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.

show abstract

“…Previous researches suggest that the complex ratio masks (CRMs) outperform both the binary masks (BMs) and real-value ratio masks (RMs) on speech separation [30,31] and enhancement [32] tasks. For this reason, the complex ideal ratio mask (cIRM) m t,f of the target speech is estimated in the separation module.…”

Section: Tf Masking Based Speech Separationmentioning

confidence: 99%

Mixed Precision DNN Qunatization for Overlapped Speech Separation and Recognition

Xu¹,

Yu²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recognition of overlapped speech has been a highly challenging task to date. State-of-the-art multi-channel speech separation system are becoming increasingly complex and expensive for practical applications. To this end, low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. However, current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different model components to quantization errors. In this paper, novel mixed precision DNN quantization methods are proposed by applying locally variable bit-widths to individual TCN components of a TF masking based multi-channel speech separation system. The optimal local precision settings are automatically learned using three techniques. The first two approaches utilize quantization sensitivity metrics based on either the mean square error (MSE) loss function curvature, or the KL-divergence measured between full precision and quantized separation models. The third approach is based on mixed precision neural architecture search. Experiments conducted on the LRS3-TED corpus simulated overlapped speech data suggest that the proposed mixed precision quantization techniques consistently outperform the uniform precision baseline speech separation systems of comparable bit-widths in terms of SI-SNR and PESQ scores as well as word error rate (WER) reductions up to 2.88% absolute (8% relative).

show abstract

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Cited by 65 publications

References 34 publications

HGCN: Harmonic gated compensation network for speech enhancement

HGCN: Harmonic gated compensation network for speech enhancement

Hybrid Neural Networks for On-device Directional Hearing

Mixed Precision DNN Qunatization for Overlapped Speech Separation and Recognition

Contact Info

Product

Resources

About