CMGAN: Conformer-based Metric GAN for Speech Enhancement

Cao, Ruizhe; Abdulatif, Sherif; Yang, Bin

doi:10.21437/interspeech.2022-517

Cited by 39 publications

(20 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The outputs of the two decoders are weighted and summed. erate on the raw waveform of speech signals and the time-frequency (TF) domain approaches [10][11][12][13][14][15][16][17][18][19][20][21] that manipulate the speech spectrogram are proposed. Although the time-domain approaches have made some success, the TF domain approach has dominated the research trend.…”

Section: Introductionmentioning

confidence: 99%

“…Typically, most of recent studies treat the real and imaginary parts as two separated real-valued sequences and model them using realvalued networks [10][11][12][13][14][15][16][17]. However, the speech spectrogram and the complex targets are naturally complex-valued, much richer representations and more efficiently modelling could be potentially achieved by complex networks [18,19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement

Zhao¹,

Ma²

2023

Preprint

View full text Add to dashboard Cite

Monaural speech enhancement has been widely studied using real networks in the time-frequency (TF) domain. However, the input and the target are naturally complex-valued in the TF domain, a fully complex network is highly desirable for effectively learning the feature representation and modelling the sequence in the complex domain. Moreover, phase, an important factor for perceptual quality of speech, has been proved learnable together with magnitude from noisy speech using complex masking or complex spectral mapping. Many recent studies focus on either complex masking or complex spectral mapping, ignoring their performance boundaries. To address above issues, we propose a fully complex dual-path dual-decoder conformer network (D2Former) using joint complex masking and complex spectral mapping for monaural speech enhancement. In D2Former, we extend the conformer network into the complex domain and form a dual-path complex TF self-attention architecture for effectively modelling the complex-valued TF sequence. We further boost the TF feature representation in the encoder and the decoders using a dual-path learning structure by exploiting complex dilated convolutions on time dependency and complex feedforward sequential memory networks (CFSMN) for frequency recurrence. In addition, we improve the performance boundaries of complex masking and complex spectral mapping by combining the strengths of the two training targets into a joint-learning framework. As a consequence, D2Former takes fully advantages of the complex-valued operations, the dual-path processing, and the joint-training targets. Compared to the previous models, D2Former achieves state-of-theart results on the VoiceBank+Demand benchmark with the smallest model size of 0.87M parameters.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement

Zhao¹,

Ma²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…This limitation is usually reflected as artifacts in the The authors are with the Institute of Signal Processing and System Theory, University of Stuttgart, Germany (e-mail: sherif.abdulatif@ iss.uni-stuttgart.de; ruizhe.cao96@gmail.com; bin.yang@iss.uni-stuttgart.de). A shorter version is available in https://arxiv.org/abs/2203.15149 [1]. reconstructed speech.…”

Section: Introductionmentioning

confidence: 99%

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Abdulatif¹,

Cao²,

Yang³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

<p>Convolution-augmented transformers (Conformers) are recently proposed in various speech-domain applications, such as automatic speech recognition (ASR) and speech separation, as they can capture both local and global dependencies. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for speech enhancement (SE) in the time-frequency (TF) domain. The generator encodes the magnitude and complex spectrogram information using two-stage conformer blocks to model both time and frequency dependencies. The decoder then decouples the estimation into a magnitude mask decoder branch to filter out unwanted distortions and a complex refinement branch to further improve the magnitude estimation and implicitly enhance the phase information. Additionally, we include a metric discriminator to alleviate metric mismatch by optimizing the generator with respect to a corresponding evaluation score. Objective and subjective evaluations illustrate that CMGAN is able to show superior performance compared to state-of-the-art methods in three speech enhancement tasks (denoising, dereverberation and super-resolution). For instance, quantitative denoising analysis on Voice Bank+DEMAND dataset indicates that CMGAN outperforms various previous models with a margin, i.e., PESQ of 3.41 and SSNR of 11.10 dB. </p>

show abstract

“…In recent years, supervised methods based on deep learning have been widely and successfully used to solve the noise reduction in non‐stationary noise environments, with the mainstream methods falling into two categories: time‐frequency domain (T‐F domain) methods and time domain methods. T‐F domain methods [7–9]: These methods usually perform short‐time fourier transform (STFT) on the noise to obtain the amplitude and phase and then achieve the enhanced amplitude by estimating the weighted mask. Finally, the original phase and enhanced amplitude of the speech signal are reconstructed by an inverse short‐time Fourier transform (iSTFT).…”

Section: Introductionmentioning

confidence: 99%

“…(1) T-F domain methods [7][8][9]: These methods usually perform short-time fourier transform (STFT) on the noise to obtain the amplitude and phase and then achieve the enhanced amplitude by estimating the weighted mask. Finally, the original phase and enhanced amplitude of the speech signal are reconstructed by an inverse short-time Fourier transform (iSTFT).…”

Section: Introductionmentioning

confidence: 99%

Multi‐stage attention network for monaural speech enhancement

Wang

Liu

et al. 2022

IET Signal Processing

View full text Add to dashboard Cite

In this paper, we propose a two-stage heterogeneous lightweight network for monaural speech enhancement. Specifically, we design a novel two-stage framework consisting of a coarse-grained full-band mask estimation stage and a fine-grained low-frequency refinement stage. Instead of using a hand-designed real-valued filter, we use a novel learnable complex-valued rectangular bandwidth (LCRB) filter bank as an extractor of compact features. Furthermore, considering the respective characteristics of the proposed two-stage task, we used a heterogeneous structure, i.e., a U-shaped subnetwork as the backbone of CoarseNet and a single-scale subnetwork as the backbone of FineNet. We conducted experiments on the VoiceBank + DEMAND and DNS datasets to evaluate the proposed approach. The experimental results show that the proposed method outperforms the current state-of-the-art methods, while maintaining relatively small model size and low computational complexity.

show abstract

CMGAN: Conformer-based Metric GAN for Speech Enhancement

Cited by 39 publications

References 0 publications

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement

D2Former: A Fully Complex Dual-Path Dual-Decoder Conformer Network using Joint Complex Masking and Complex Spectral Mapping for Monaural Speech Enhancement

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Multi‐stage attention network for monaural speech enhancement

Contact Info

Product

Resources

About