Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1728
|View full text |Cite
|
Sign up to set email alerts
|

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Abstract: Significant performance degradation of automatic speech recognition (ASR) systems is observed when the audio signal contains cross-talk. One of the recently proposed approaches to solve the problem of multi-speaker ASR is the deep clustering (DPCL) approach. Combining DPCL with a state-of-the-art hybrid acoustic model, we obtain a word error rate (WER) of 16.5 % on the commonly used wsj0-2mix dataset, which is the best performance reported thus far to the best of our knowledge. The wsj0-2mix dataset contains s… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
16
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 25 publications
(16 citation statements)
references
References 18 publications
(46 reference statements)
0
16
0
Order By: Relevance
“…In the single-channel multi-speaker speech separation and recognition tasks, several techniques have been proposed, achieving significant progress. One such technique is deep clustering (DPCL) [2][3][4]. In DPCL, a neural network is trained to map each time-frequency unit to an embedding vector, which is used to assign each unit to a source by a clustering algorithm afterwards.…”
Section: Introductionmentioning
confidence: 99%
“…In the single-channel multi-speaker speech separation and recognition tasks, several techniques have been proposed, achieving significant progress. One such technique is deep clustering (DPCL) [2][3][4]. In DPCL, a neural network is trained to map each time-frequency unit to an embedding vector, which is used to assign each unit to a source by a clustering algorithm afterwards.…”
Section: Introductionmentioning
confidence: 99%
“…In [72], the authors combined approaches to address the cross-talk problem called deep clustering (DPCL) by creating a hybrid acoustic model. They obtained a WER of 16.5% on the wsj0-2mix dataset, which is the best performance reported so far.…”
Section: ) Speech Overlapping (Simultaneous Conversation)mentioning
confidence: 99%
“…Other works already studied the effectiveness of frequency domain source separation techniques as a front-end for ASR. DPCL and PIT have been efficiently used for this purpose, and it was shown that joint retraining for fine-tuning can improve performance [7,8,10]. E2E systems for single-channel multi-speaker ASR have been proposed that no longer consist of individual parts dedicated for source separation and speech recognition, but combine these functionalities into one large monolithic neural network.…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Based on these source separation techniques, multi-speaker ASR systems have been constructed. DPCL and PIT have been used as frequency domain source separation front-ends for a state-of-theart single-speaker ASR system and extended to jointly trained E2E or hybrid systems [7,8,9,10]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system.…”
Section: Introductionmentioning
confidence: 99%