2016
DOI: 10.1186/s13636-016-0085-x
|View full text |Cite
|
Sign up to set email alerts
|

Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks

Abstract: Time-frequency (T-F) masking is an effective method for stereo speech source separation. However, reliable estimation of the T-F mask from sound mixtures is a challenging task, especially when room reverberations are present in the mixtures. In this paper, we propose a new stereo speech separation system where deep neural networks are used to generate soft T-F mask for separation. More specifically, the deep neural network, which is composed of two sparse autoencoders and a softmax regression, is used to estim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
30
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
1
1

Relationship

4
3

Authors

Journals

citations
Cited by 44 publications
(30 citation statements)
references
References 37 publications
0
30
0
Order By: Relevance
“…(18)]. Repeated iterations of the Eand M-steps are performed to obtain final estimates of the parameters, and subsequently ν iτ (ω, t) in the final E-step is computed using (19). Clearly, summing ν iτ (ω, t) over all possible delays τ gives the probability of the ith source being dominant at the time-frequency point (ω, t).…”
Section: X(ω T) = [L(ω T) R(ω T)]mentioning
confidence: 99%
See 2 more Smart Citations
“…(18)]. Repeated iterations of the Eand M-steps are performed to obtain final estimates of the parameters, and subsequently ν iτ (ω, t) in the final E-step is computed using (19). Clearly, summing ν iτ (ω, t) over all possible delays τ gives the probability of the ith source being dominant at the time-frequency point (ω, t).…”
Section: X(ω T) = [L(ω T) R(ω T)]mentioning
confidence: 99%
“…Note that the time-frequency mask given by (20) is a byproduct of the EM algorithm, computed using the output of the E-step (19) which gives the probability of a spectrogram point (ω, t) coming from source i and delay τ , conditional on the interaural cues α(ω, t) and ϕ(ω, t) (estimated using the spectrogram of the observed speech mixture), in addition to the parameter estimates from the final M-step of the EM algorithm. For each bootstrap replication, the E-step allows us to compute this probability or T-F masks, conditional on the interaural cues α * (ω, t) and ϕ * (ω, t), estimated from the spectrogram of the bootstrap speech mixture.…”
Section: (1) L(n ) R(1) R(n )mentioning
confidence: 99%
See 1 more Smart Citation
“…More recently, Deep Neural Networks (DNNs) [5] have shown the state-of-the-art performance in source separation [6]- [8].…”
Section: Introductionmentioning
confidence: 99%
“…The method proposed in [8], [9] extracts a target speech signal from a competing speech signal, using a DNN trained using binaural spatial cues of mixing vectors (MV), interaural level difference (ILD) and interaural phase difference (IPD). However, the above spatial cues become less effective for speech-noise scenarios where the target speech is often masked over by the background noise in adverse conditions.…”
Section: Introductionmentioning
confidence: 99%