Wavesplit: End-to-End Speech Separation by Speaker Clustering

Zeghidour, Neil; Grangier, David

doi:10.48550/arxiv.2002.08933

Cited by 47 publications

(90 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table II presents the experimental results of different models for speech separation on WSJ0-2mix dataset [39] in terms of both ∆SI-SNR and ∆SDR. For the results without using data augmentation, our DPRNN-SRSSN outperforms all other methods except Wavesplit [20] in terms of both metrics while DPTNET-SRSSN performs better than all other methods, which demonstrates advantages of our model. The methods which learn a separable encoding space defined by a latent domain, generally perform better than the other type of methods separating speech in frequency domain explicitly, which implies that the frequency domain is not necessarily the best separation space for speech as described in [10].…”

Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning

confidence: 76%

“…2) Comparison with State-of-the-art Methods on WSJ0-2mix (involving 2 speakers): Next we conduct experiments to compare our model with state-of-the-art methods for speech separation on WSJ0-2mix dataset [2]. In particular, we compare our model with 2 types of methods: 1) methods performing separation in the frequency domain, including DPCL++ [3], UPIT-Bi-LSTM-ST [5], Chimera++ [7] and Deep CASA [8]; 2) methods performing separation in a learnable latent domain in an end-to-end way, including Bi-LSTM-TASNET [12], Conv-TASNET [10], E2EPF [28], FurcaNeXt [13], DRPNN-TASNET [15], SuDoRM-RF [17], Nachmani et al [16], DPTNET-TASNET [18], SepFormer [19] and Wavesplit [20]. We evaluate the performance of two versions of our SRSSN: DPRNN-SRSSN and DPTNET-SRSSN.…”

Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning

confidence: 99%

“…With great success of deep learning in many fields such as computer vision and machine learning by the powerful capability of feature representation, another research line of speech separation [10]- [20] seeks to leverage deep convolutional neural networks (CNNs) to learn an embedding space that is separable for speech signals between different speakers. A remarkable benefit of such methods is that the whole separation procedures including encoding, separation and decoding can be integrated into an end-to-end model, which is in contrast to the sequentially individual steps in aforementioned STFT-based methods.…”

mentioning

confidence: 99%

“…Nachmani et al [16] introduce a task to minimize the distance between the speaker embeddings of the estimated signal and target signal, where the speaker embeddings are extracted from a separately trained speaker identification network. Zeghidour et al [20] propose the Wavesplit network consisting of a speaker stack and a separation stack, where the speaker stack extracts frame-level speaker-discriminative vectors and obtain speaker centroids employing clustering, and the separation stack estimates isolated speech signals conditioned the speaker centroids. Two-stage or cascaded architectures.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

Yao,

Pei,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

The crux of single-channel speech separation is how to encode the mixture of signals into such a latent embedding space that the signals from different speakers can be precisely separated. Existing methods for speech separation either transform the speech signals into frequency domain to perform separation or seek to learn a separable embedding space by constructing a latent domain based on convolutional filters. While the latter type of methods learning an embedding space achieves substantial improvement for speech separation, we argue that the embedding space defined by only one latent domain does not suffice to provide a thoroughly separable encoding space for speech separation. In this paper, we propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework. It first learns a 1order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase. Then the proposed SRSSN learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase, which enables our model to perform a refining separation to achieve a more precise speech separation. We demonstrate the effectiveness of our SRSSN by conducting extensive experiments, including speech separation in a clean (noise-free) setting on WSJ0-2/3mix datasets as well as in noisy/reverberant settings on WHAM!/WHAMR! datasets. Furthermore, we also perform experiments of speech recognition on separated speech signals by our model to evaluate the performance of speech separation indirectly.Index Terms-Speech separation, high-order latent domain, coarse-to-fine, end-to-end I. INTRODUCTIONS PEECH separation aims to separate out the clean speech signals for each involved speaker from a mixture of speech signals. It plays an important role in speech processing [1], especially in the scenario of mixed and noisy speech environment. Speech separation, particularly under the single-channel condition, is still a highly challenging research problem due to the difficulty of encoding the mixed speech signal into an entirely separable embedding feature space. This paper focuses on single-channel speech separation.A classical type of methods [2]-[8] for single-channel speech separation is to transform the input mixture of temporal

show abstract

Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning

confidence: 76%

Section: ) Ablation Study: We Perform Ablation Experiments On Nine Va...mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

Yao,

Pei,

Chen

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…When using speaker embedding, the target speaker of interest is assumed to be known a priori. Techniques developed for speaker separation have also been applied to remove nonspeech noise [32,33], with modifications to the training data. AEC has also been studied in isolation [34], or together with background noise [35].…”

Section: Introductionmentioning

confidence: 99%

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

O’Malley¹,

Narayanan²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a frontend for improving robustness of automatic speech recognition (ASR), that jointly implements three modules within a single model: acoustic echo cancellation, speech enhancement, and speech separation. This is achieved by using a contextual enhancement neural network that can optionally make use of different types of side inputs: (1) a reference signal of the playback audio, which is necessary for echo cancellation; (2) a noise context, which is useful for speech enhancement; and (3) an embedding vector representing the voice characteristic of the target speaker of interest, which is not only critical in speech separation, but also helpful for echo cancellation and speech enhancement. We present detailed evaluations to show that the joint model performs almost as well as the task-specific models, and significantly reduces word error rate in noisy conditions even when using a large-scale state-of-the-art ASR model. Compared to the noisy baseline, the joint model reduces the word error rate in low signal-to-noise ratio conditions by at least 71% on our echo cancellation dataset, 10% on our noisy dataset, and 26% on our multi-speaker dataset. Compared to task-specific models, the joint model performs within 10% on our echo cancellation dataset, 2% on the noisy dataset, and 3% on the multi-speaker dataset.

show abstract

An End-to-End Speech Separation Method Based on Features of Two Domains

Yu,

Qiu,

et al. 2024

J. Vib. Eng. Technol.

View full text Add to dashboard Cite

Wavesplit: End-to-End Speech Separation by Speaker Clustering

Cited by 47 publications

References 36 publications

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

An End-to-End Speech Separation Method Based on Features of Two Domains

Contact Info

Product

Resources

About