ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8682783
|View full text |Cite
|
Sign up to set email alerts
|

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

Abstract: In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's outpu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
77
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 98 publications
(78 citation statements)
references
References 17 publications
(27 reference statements)
1
77
0
Order By: Relevance
“…3) The motivation is that using ℒ %,'( alone produces worse magnitude estimates, as the estimated magnitudes need to compensate for the estimation error of phase. A major difference from [31], [33] is that we do not perform power or logarithmic compression on the magnitude spectra. This way, the DNN is always trained to estimate an STFT spectrogram that has consistent phase and magnitude structure, and hence would likely produce a good consistent STFT spectrogram at run time [34], [35].…”
Section: Siso1-bf-siso2 Systemmentioning
confidence: 99%
“…3) The motivation is that using ℒ %,'( alone produces worse magnitude estimates, as the estimated magnitudes need to compensate for the estimation error of phase. A major difference from [31], [33] is that we do not perform power or logarithmic compression on the magnitude spectra. This way, the DNN is always trained to estimate an STFT spectrogram that has consistent phase and magnitude structure, and hence would likely produce a good consistent STFT spectrogram at run time [34], [35].…”
Section: Siso1-bf-siso2 Systemmentioning
confidence: 99%
“…In the iterative version (iTDCN++), the process of estimating the source signals is repeated twice. The mixture audio m and output estimatesŝ (1) of the first TDCN++ module serve as input for the second TDCN++ module, which produces the final separation estimatesŝ (2) . Both separation networks are trained using permutation-invariant [5] negative signal to noise ratio (SNR), given by…”
Section: Time-domain Source Separationmentioning
confidence: 99%
“…This extends the aforementioned end-to-end approach in order to predict estimates of the clean sourcesŝ (1) . These estimates, as well as the mixture m, are fed to a second sound classifier to extract embeddings for the source estimates and the mixtureV (2) all . These embeddings condition a second source separation subnetwork TDCN++ (2) which produces the final source estimatesŝ (2) .…”
Section: Fine-tuned Embeddings For the Iterative Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…However, its result is often inconsistent [15][16][17], and thus the estimated amplitude and phase may change by applying the inverse STFT (iSTFT) and STFT. Although several DNN-based monaural speech enhancement and separation methods improve the performance by considering consistency [18,19], the consistency was not taken into account in DNN-based multi-channel speech enhancement.…”
Section: Introductionmentioning
confidence: 99%