2018
DOI: 10.1109/taslp.2018.2795749
|View full text |Cite
|
Sign up to set email alerts
|

Speaker-Independent Speech Separation With Deep Attractor Network

Abstract: Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture (permutation problem), and the second is the unknown number of speakers in the mixture (output dimension problem). We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to proje… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
139
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 209 publications
(144 citation statements)
references
References 44 publications
(74 reference statements)
0
139
0
Order By: Relevance
“…We also compare the results with those of prior works in Table 2, all data on the test set for comparison, including scale-invariant signal-to-noise ratio improvement (SI-SNR i ) [6] and SDR i in dB. Row (a) is for the baseline separation model TasNet-v2 [4] we used throughout this work.…”
Section: Summary Of the Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We also compare the results with those of prior works in Table 2, all data on the test set for comparison, including scale-invariant signal-to-noise ratio improvement (SI-SNR i ) [6] and SDR i in dB. Row (a) is for the baseline separation model TasNet-v2 [4] we used throughout this work.…”
Section: Summary Of the Resultsmentioning
confidence: 99%
“…Most of such approaches first transform the time-domain mixture waveform into some feature map, such as the spectrogram or 2-D feature map encoded by 1-D convolution blocks. An often used approach is then to infer a mask for each individual speaker [5,6,7,8,9,10], and multiply the masks element-wise with the mixture feature map to obtain the individual feature maps. A recent work integrating different mixture representations and performing cross-domain joint clustering for mask-inference has also shown encouraging improvements [11].…”
Section: Introductionmentioning
confidence: 99%
“…Specifically for separation tasks, speaker-discriminative embeddings are produced for targeted voice separation in [6] and for diarization in [17] yielding a significant improvement over the unconditional separation framework. Recent works [18,19] have utilized conditional embeddings for each music class in order to boost the performance of a deep attractor-network [20] for music separation.…”
Section: Introductionmentioning
confidence: 99%
“…As supervised speech source separation techniques, recently, deep neural network (DNN) based approaches with a training dataset This work was done while Yoshiki Masuyama and Yu Nakagome were interns at LINE Corporation. in which there are microphone input signal and corresponding oracle clean data have been widely studied, e.g., deep clustering (DC) [10,11], permutation invariant training (PIT) [12,13], deep attractor network [14,15], and hybrid approaches with BSS [16][17][18]. DNN based approaches can capture complicated spectral characteristics of a speech source.…”
Section: Introductionmentioning
confidence: 99%