Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

Raj, Desh; Denisov, Pavel; Chen, Zhuo; Erdoğan, Hakan; Huang, Zili; He, Maokui; Watanabe, Shinji; Du, Jun; Yoshioka, Takuya; Luo, Yi; Kanda, Naoyuki; Li, Jinyu; Wisdom, Scott; Hershey, John R.

doi:10.1109/slt48900.2021.9383556

Cited by 51 publications

(34 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2. Note that the system named "Joint System 2 (J2)" is a new pipeline that we propose in this paper while other three systems are known pipelines for SA-ASR that has been investigated in prior works (such as [13,32]).…”

Section: Modular and Joint Systems For Speaker-attributed Asrmentioning

confidence: 99%

“…There have been a lot of studies on microphone array recordings to improve speech separation [6][7][8], speaker diarization [6,9] and ASR systems [10,11] by using spatial information. On the other hand, SA-ASR based on a single microphone is still highly challenging, and only limited amount of studies have been conducted for a fully automatic SA-ASR system on the monaural long-form audio [12][13][14].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

Kanda¹,

Xiao²,

Wu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the Vox-Celeb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 9.2-29.4% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.

show abstract

Section: Modular and Joint Systems For Speaker-attributed Asrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

Kanda¹,

Xiao²,

Wu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is what the automatic system is trying to learn. For many downstream tasks of speech processing, such as speaker diarization [2] and automatic speech recognition [3], speech separation is a necessary pre-processing.…”

Section: Introductionmentioning

confidence: 99%

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Wang¹,

Peng

Lee

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Speech separation has been extensively studied to deal with the cocktail party problem in recent years. All related approaches can be divided into two categories: time-frequency domain methods and time domain methods. In addition, some methods try to generate speaker vectors to support source separation. In this study, we propose a new model called dualpath filter network (DPFN). Our model focuses on the postprocessing of speech separation to improve speech separation performance. DPFN is composed of two parts: the speaker module and the separation module. First, the speaker module infers the identities of the speakers. Then, the separation module uses the speakers' information to extract the voices of individual speakers from the mixture. DPFN constructed based on DPRNN-TasNet is not only superior to DPRNN-TasNet, but also avoids the problem of permutation-invariant training (PIT).

show abstract

“…Many variants of this approach have been investigated such as the methods using agglomerate hierarchical clustering (AHC) [2], spectral clustering (SC) [3], and variational Bayesian inference [4,5]. While these approaches showed a good performance for difficult test conditions [6], they cannot handle overlapped speech [7]. Several extensions were also proposed to handle overlapping speech, such as using overlapping detection [8] and speech separation [9].…”

Section: Introductionmentioning

confidence: 99%

“…Target-speaker voice activity detection (TS-VAD) [15] is another approach where the neural network is trained to estimate speech activities of all the speakers specified by a set of pre-estimated speaker embeddings. Of these speaker diarization methods, TS-VAD achieved the state-ofthe-art (SOTA) results in several diarization tasks [7,15] including recent international competitions [16,17]. On the other hand, TS-VAD has a limitation that the number of recognizable speakers is bounded by the number of output nodes of the model.…”

Section: Introductionmentioning

confidence: 99%

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Kanda¹,

Xiong²,

Gaur³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents Transcribe-to-Diarize, a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR). The E2E SA-ASR is a joint model that was recently proposed for speaker counting, multi-talker speech recognition, and speaker identification from monaural audio that contains overlapping speech. Although the E2E SA-ASR model originally does not estimate any time-related information, we show that the start and end times of each word can be estimated with sufficient accuracy from the internal state of the E2E SA-ASR by adding a small number of learnable parameters. Similar to the target-speaker voice activity detection (TS-VAD)-based diarization method, the E2E SA-ASR model is applied to estimate speech activity of each speaker while it has the advantages of (i) handling unlimited number of speakers, (ii) leveraging linguistic information for speaker diarization, and (iii) simultaneously generating speakerattributed transcriptions. Experimental results on the LibriCSS and AMI corpora show that the proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown, and achieves a comparable performance to TS-VAD when the number of speakers is given in advance. The proposed method simultaneously generates speaker-attributed transcription with state-of-the-art accuracy.

show abstract

Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

Cited by 51 publications

References 35 publications

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Contact Info

Product

Resources

About