Multi-turn RNN-T for streaming recognition of multi-party speech

Sklyar, Ilya; Piunova, Anna; Zheng, Xianrui; Liu, Yulan

doi:10.48550/arxiv.2112.10200

Cited by 1 publication

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Firstly, we observed that the t-SOT TT-18 with only 40 msec algorithmic latency already outperformed the results of all prior streaming multi-talker ASR models. Note that even though t-SOT TT-18 has almost the same number of parameters with SURT [26,32] or MS-RNN-T [27,34], t-SOT is more space and computationally efficient in the inference because SURT and MS-RNN-T run decoding twice, once for each of the two output branches. Secondly, we observed a significant WER reduction by increasing algorithmic latency and the model size.…”

Section: Resultsmentioning

confidence: 99%

“…1 To increase the variability of the training data, we applied the speed perturbation [37] with the ratios of {0.9, 1.0, 1.1}, the volume perturbation with the ratio between 0.125 to 2.0, and the adaptive SpecAugment [38]. Following [21,34], we simulated the training data on the fly to generate infinite variations of the training samples.…”

Section: Experimental Settingsmentioning

confidence: 99%

“…Prior offline SOT Conformer-AED[20] was trained with a finite set of simulated data instead of on-the-fly data generation[21,34], which is most likely the reason of better WER by our t-SOT TT.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Kanda¹,

Wu²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multitalker ASR models using multiple output layers, the t-SOT model has only a single output layer that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single-and multi-talker scenarios.

show abstract