Streaming Multi-Speaker ASR with RNN-T

Sklyar, Ilya; Piunova, Anna; Liu, Yulan

doi:10.1109/icassp39728.2021.9413471

Cited by 22 publications

(21 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Firstly, we observed that the t-SOT TT-18 with only 40 msec algorithmic latency already outperformed the results of all prior streaming multi-talker ASR models. Note that even though t-SOT TT-18 has almost the same number of parameters with SURT [26,32] or MS-RNN-T [27,34], t-SOT is more space and computationally efficient in the inference because SURT and MS-RNN-T run decoding twice, once for each of the two output branches. Secondly, we observed a significant WER reduction by increasing algorithmic latency and the model size.…”

Section: Resultsmentioning

confidence: 99%

“…For streaming multi-talker ASR, the t-SOT framework has various advantages over SURT [26] and MS-RNN-T [27]. Firstly, t-SOT requires only a single decoding process as with the conventional single-talker ASR while SURT and MS-RNN-T require to execute the decoder multiple times (i.e., one decoder run for each output branches).…”

Section: Comparison To Prior Workmentioning

confidence: 99%

“…A few recent studies explored the streaming multi-talker ASR problem to transcribe each spoken word with a low latency even for overlapping speech. Streaming unmixing and recognition transducer (SURT) [26] and multi-speaker recurrent neural network transducer (MS-RNN-T) [27] were concurrently proposed based on a similar idea, where the model has two output branches to generate two simultaneous transcriptions for overlapping speech. However, their reported WERs still lagged far behind the SOTA result of the offline SOT-model.…”

Section: Introductionmentioning

confidence: 99%

“…A special token indicating "virtual" output channels is used to keep track of the transcriptions for different speakers. Compared to the prior streaming multi-talker ASR models [26,27], the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, our experimental results with LibriSpeechMix and LibriCSS datasets show that the transformer transducer (TT) [28] trained with t-SOT framework achieves a new SOTA results by a significant margin to the prior results including the offline models.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Kanda¹,

Wu²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multitalker ASR models using multiple output layers, the t-SOT model has only a single output layer that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single-and multi-talker scenarios.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Comparison To Prior Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Kanda¹,

Wu²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…While promising results were shown for such joint systems, most of the previous studies were limited to either simulated data [15,17,25,[28][29][30][31][32][33][34][35] or small-scale real data [11,[36][37][38]. It is because of the scarcity of training data for real meeting recordings, which takes a lot of time to precisely transcribe.…”

Section: Introductionmentioning

confidence: 99%

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

Kanda¹,

Xiao²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the Vox-Celeb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 9.2-29.4% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.

show abstract

Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

et al. 2023

View full text Add to dashboard Cite

Automatic speech recognition of a target speaker in the presence of interfering speakers remains a challenging issue. One approach to tackle this problem is target-speaker speech recognition, which conditions the recognition process on an embedding that characterizes the voice of the target speaker. This enables recognizing only the speech of the target speaker while ignoring interferences. In this work, we propose an end-to-end target-speaker speech recognition system based on a neural transducer architecture to allow streaming and on-device recognition. Moreover, a target-speaker speech recognition system should be able to detect when the target speaker is inactive and output nothing in such a case. We introduce training and decoding schemes to allow target-speaker activity detection within our proposed recognition system. We confirm experimentally that our proposed end-to-end system performs competitively to conventional cascade approaches of a target speech extraction module and a recognition module while reducing computation costs and allowing streaming decoding.

show abstract

Streaming Multi-Speaker ASR with RNN-T

Cited by 22 publications

References 22 publications

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

Streaming End-to-End Target-Speaker Automatic Speech Recognition and Activity Detection

Contact Info

Product

Resources

About