Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1629
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

Abstract: This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
116
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 117 publications
(118 citation statements)
references
References 38 publications
(67 reference statements)
2
116
0
Order By: Relevance
“…A major difference from [31], [33] is that we do not perform power or logarithmic compression on the magnitude spectra. This way, the DNN is always trained to estimate an STFT spectrogram that has consistent phase and magnitude structure, and hence would likely produce a good consistent STFT spectrogram at run time [34], [35].…”
Section: Siso1-bf-siso2 Systemmentioning
confidence: 99%
“…A major difference from [31], [33] is that we do not perform power or logarithmic compression on the magnitude spectra. This way, the DNN is always trained to estimate an STFT spectrogram that has consistent phase and magnitude structure, and hence would likely produce a good consistent STFT spectrogram at run time [34], [35].…”
Section: Siso1-bf-siso2 Systemmentioning
confidence: 99%
“…Motivated by the recent advance in deep learning, several DNNbased phase reconstruction methods have been presented [18][19][20][21][22][23]. However, phase reconstruction from a given amplitude spectrogram is not an easy task for DNNs due to the following two problems: the wrapping effect and sensitivity to a shift of a waveform.…”
Section: Phase Reconstruction Via Dnnmentioning
confidence: 99%
“…Removing unwanted environmental background noise in speech signals is a common step in speech processing applications. Complex valued neural networks as well as phase estimation have been of great interest in speech enhancement lately, since the perceptual audio quality has been reported to be improved significantly [5,7,6,10].…”
Section: Related Workmentioning
confidence: 99%
“…While other work uses the whole signal in an off-line processing fashion as input for the noise reduction [9,10,11], our work requires real-time capabilities. Both high-res spectrograms and off-line processing are not feasible for hearing aid applications, where the overall latency is a very important property.…”
Section: Introductionmentioning
confidence: 99%