Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Menne, Tobias; Sklyar, Ilya; Schlüter, Ralf; Ney, Hermann

doi:10.21437/interspeech.2019-1728

Cited by 25 publications

(16 citation statements)

References 18 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the single-channel multi-speaker speech separation and recognition tasks, several techniques have been proposed, achieving significant progress. One such technique is deep clustering (DPCL) [2][3][4]. In DPCL, a neural network is trained to map each time-frequency unit to an embedding vector, which is used to assign each unit to a source by a clustering algorithm afterwards.…”

Section: Introductionmentioning

confidence: 99%

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Chang

Zhang

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

102

View full text Add to dashboard Cite

Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-tosequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-toend framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.

show abstract

Section: Introductionmentioning

confidence: 99%

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Chang

Zhang

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

102

View full text Add to dashboard Cite

show abstract

“…In [72], the authors combined approaches to address the cross-talk problem called deep clustering (DPCL) by creating a hybrid acoustic model. They obtained a WER of 16.5% on the wsj0-2mix dataset, which is the best performance reported so far.…”

Section: ) Speech Overlapping (Simultaneous Conversation)mentioning

confidence: 99%

Automatic Speech Recognition: Systematic Literature Review

et al. 2021

View full text Add to dashboard Cite

A huge amount of research has been done in the field of speech signal processing in recent years. In particular, there has been increasing interest in the automatic speech recognition (ASR) technology field. ASR began with simple systems that responded to a limited number of sounds and has evolved into sophisticated systems that respond fluently to natural language. This systematic review of automatic speech recognition is provided to help other researchers with the most significant topics published in the last six years. This research will also help in identifying recent major ASR challenges in real-world environments. In addition, it discusses current research gaps in ASR. This review covers articles available in five research databases that were completed according to the preferred reporting items for systematic reviews and metaanalyses (PRISMA) protocol. The search strategy yielded 45 articles related to the study's scope for the period 2015-2020. The results presented in this review shed light on research trends in the area of ASR and also suggest new research directions.

show abstract

“…Other works already studied the effectiveness of frequency domain source separation techniques as a front-end for ASR. DPCL and PIT have been efficiently used for this purpose, and it was shown that joint retraining for fine-tuning can improve performance [7,8,10]. E2E systems for single-channel multi-speaker ASR have been proposed that no longer consist of individual parts dedicated for source separation and speech recognition, but combine these functionalities into one large monolithic neural network.…”

Section: Relation To Prior Workmentioning

confidence: 99%

“…Based on these source separation techniques, multi-speaker ASR systems have been constructed. DPCL and PIT have been used as frequency domain source separation front-ends for a state-of-theart single-speaker ASR system and extended to jointly trained E2E or hybrid systems [7,8,9,10]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Training of Time Domain Audio Separation and Recognition

Neumann

Kinoshita

Drude

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0 % on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

show abstract

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Cited by 25 publications

References 18 publications

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Automatic Speech Recognition: Systematic Literature Review

End-to-End Training of Time Domain Audio Separation and Recognition

Contact Info

Product

Resources

About