ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683437
|View full text |Cite
|
Sign up to set email alerts
|

Low-latency Deep Clustering for Speech Separation

Abstract: This paper proposes a low algorithmic latency adaptation of the deep clustering approach to speaker-independent speech separation. It consists of three parts: a) the usage of longshort-term-memory (LSTM) networks instead of their bidirectional variant used in the original work, b) using a short synthesis window (here 8 ms) required for low-latency operation, and, c) using a buffer in the beginning of audio mixture to estimate cluster centres corresponding to constituent speakers which are then utilized to sepa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 22 publications
0
13
0
Order By: Relevance
“…In TF spectrum based speech separation with DNNs, the training targets for supervised learning are usually in the form of a TF representation, e.g., TF masks or affinity matrices (for deep clustering [5], [18]). The STFT is a popular choice where the choice of the window length is important.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In TF spectrum based speech separation with DNNs, the training targets for supervised learning are usually in the form of a TF representation, e.g., TF masks or affinity matrices (for deep clustering [5], [18]). The STFT is a popular choice where the choice of the window length is important.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…In order to show the independence of our proposed approach from the type of models and datasets used, we evaluate it on two tasks: speaker-independent separation with an online DC model [18], and speaker-dependent separation with mask inference (MI) network that directly predicts masks. We evaluate the former and latter on two-speaker mixtures from Wall Street Journal (WSJ0) [19] and Danish HINT [20,21] databases, respectively.…”
Section: Introductionmentioning
confidence: 99%
“…Many techniques reported consisted of multiple stages separately optimized under different criteria, such as signal representation and embedding [15]. Some embedding clustering methods add phase information on multi-channel, and other research concerns the low-delay of deep clustering approaches [16], [17]. Orthogonal deep clustering improves the separation performance of the model by adding an orthogonal constraint penalty term of the objective function to reduce the correlation between the embedded expression [18].…”
Section: Related Work a Speech Separation Based On Deep Clusteringmentioning
confidence: 99%
“…However, the MISI algorithm has been introduced in a heuristic fashion, therefore there is currently no proof that it converges. Besides, while several recent works addressed the problem of low-latency magnitude estimation [20], [21], [22], the MISI algorithm operates offline, as it computes the whole STFT and its inverse at each iteration. This makes it impracticable for real-time applications such as hearing aids.…”
Section: Introduction Audio Source Separationmentioning
confidence: 99%