2022
DOI: 10.48550/arxiv.2203.17004
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain

Abstract: Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals. In this work, we extend these models to the complex short-time Fourier transform (STFT) domain, proposing a novel training task for speech enhancement using a complexvalued deep neural network. We derive this training task within the formalism of stochastic differential equations, thereby enabling the use of predicto… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(10 citation statements)
references
References 18 publications
0
10
0
Order By: Relevance
“…Experimental results show that SGMSE [124] maintains more natural structures with fewer artifacts than prior work, and achieves quantitative improvements, e.g., an SI-SAR improvement of 3dB over CDiffuSE [63].…”
Section: Audio Restoration In the Time-frequency Domainmentioning
confidence: 92%
See 3 more Smart Citations
“…Experimental results show that SGMSE [124] maintains more natural structures with fewer artifacts than prior work, and achieves quantitative improvements, e.g., an SI-SAR improvement of 3dB over CDiffuSE [63].…”
Section: Audio Restoration In the Time-frequency Domainmentioning
confidence: 92%
“…Although formulated on diffusion models, CDiffuSE [63] is trained to estimate the difference between clean and noisy speech. Therefore, it is pointed out in [124] that CDiffuSE [63] can be considered a discriminative task. To make the method pure generative and also avoid any noise prior, SGMSE [124] proposes a method based on stochastic differential equations (SDE) [103,105].…”
Section: Audio Restoration In the Time-frequency Domainmentioning
confidence: 99%
See 2 more Smart Citations
“…Denoising diffusion probabilistic models (diffusion models for short) have achieved the state-of-theart (SOTA) generation results in various tasks, including image [34,22,8,7,33,39,44] and super resolution image generation [13,31,41,25], text-to-image generation [23,11,14,28], text-to-speech synthesis [4,15,27,17,16,5] and speech enhancement [20,21,42]. Especially, in audio synthesis, diffusion models have shown strong ability in modelling both spectrogram features [27,17] and raw waveforms [4,15,5].…”
Section: Denoising Diffusion Probabilistic Modelsmentioning
confidence: 99%