2020
DOI: 10.48550/arxiv.2002.08933
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Abstract: We introduce Wavesplit, an end-to-end speech separation system. From a single recording of mixed speech, the model infers and clusters representations of each speaker and then estimates each source signal conditioned on the inferred representations. The model is trained on the raw waveform to jointly perform the two tasks. Our model infers a set of speaker representations through clustering, which addresses the fundamental permutation problem of speech separation. Moreover, the sequence-wide speaker representa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

1
89
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 47 publications
(90 citation statements)
references
References 36 publications
1
89
0
Order By: Relevance
“…Table II presents the experimental results of different models for speech separation on WSJ0-2mix dataset [39] in terms of both ∆SI-SNR and ∆SDR. For the results without using data augmentation, our DPRNN-SRSSN outperforms all other methods except Wavesplit [20] in terms of both metrics while DPTNET-SRSSN performs better than all other methods, which demonstrates advantages of our model. The methods which learn a separable encoding space defined by a latent domain, generally perform better than the other type of methods separating speech in frequency domain explicitly, which implies that the frequency domain is not necessarily the best separation space for speech as described in [10].…”
Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning
confidence: 76%
See 3 more Smart Citations
“…Table II presents the experimental results of different models for speech separation on WSJ0-2mix dataset [39] in terms of both ∆SI-SNR and ∆SDR. For the results without using data augmentation, our DPRNN-SRSSN outperforms all other methods except Wavesplit [20] in terms of both metrics while DPTNET-SRSSN performs better than all other methods, which demonstrates advantages of our model. The methods which learn a separable encoding space defined by a latent domain, generally perform better than the other type of methods separating speech in frequency domain explicitly, which implies that the frequency domain is not necessarily the best separation space for speech as described in [10].…”
Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning
confidence: 76%
“…2) Comparison with State-of-the-art Methods on WSJ0-2mix (involving 2 speakers): Next we conduct experiments to compare our model with state-of-the-art methods for speech separation on WSJ0-2mix dataset [2]. In particular, we compare our model with 2 types of methods: 1) methods performing separation in the frequency domain, including DPCL++ [3], UPIT-Bi-LSTM-ST [5], Chimera++ [7] and Deep CASA [8]; 2) methods performing separation in a learnable latent domain in an end-to-end way, including Bi-LSTM-TASNET [12], Conv-TASNET [10], E2EPF [28], FurcaNeXt [13], DRPNN-TASNET [15], SuDoRM-RF [17], Nachmani et al [16], DPTNET-TASNET [18], SepFormer [19] and Wavesplit [20]. We evaluate the performance of two versions of our SRSSN: DPRNN-SRSSN and DPTNET-SRSSN.…”
Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning
confidence: 99%
See 2 more Smart Citations
“…When using speaker embedding, the target speaker of interest is assumed to be known a priori. Techniques developed for speaker separation have also been applied to remove nonspeech noise [32,33], with modifications to the training data. AEC has also been studied in isolation [34], or together with background noise [35].…”
Section: Introductionmentioning
confidence: 99%