ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413471
|View full text |Cite
|
Sign up to set email alerts
|

Streaming Multi-Speaker ASR with RNN-T

Abstract: Recent research shows end-to-end ASR systems can recognize overlapped speech from multiple speakers. However, all published works have assumed no latency constraints during inference, which does not hold for most voice assistant interactions. This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) that has been shown to provide high recognition accuracy at a low latency online recognition regime. We investigate two approaches to multi-speaker model training … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
19
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(21 citation statements)
references
References 22 publications
0
19
0
Order By: Relevance
“…Firstly, we observed that the t-SOT TT-18 with only 40 msec algorithmic latency already outperformed the results of all prior streaming multi-talker ASR models. Note that even though t-SOT TT-18 has almost the same number of parameters with SURT [26,32] or MS-RNN-T [27,34], t-SOT is more space and computationally efficient in the inference because SURT and MS-RNN-T run decoding twice, once for each of the two output branches. Secondly, we observed a significant WER reduction by increasing algorithmic latency and the model size.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…Firstly, we observed that the t-SOT TT-18 with only 40 msec algorithmic latency already outperformed the results of all prior streaming multi-talker ASR models. Note that even though t-SOT TT-18 has almost the same number of parameters with SURT [26,32] or MS-RNN-T [27,34], t-SOT is more space and computationally efficient in the inference because SURT and MS-RNN-T run decoding twice, once for each of the two output branches. Secondly, we observed a significant WER reduction by increasing algorithmic latency and the model size.…”
Section: Resultsmentioning
confidence: 99%
“…For streaming multi-talker ASR, the t-SOT framework has various advantages over SURT [26] and MS-RNN-T [27]. Firstly, t-SOT requires only a single decoding process as with the conventional single-talker ASR while SURT and MS-RNN-T require to execute the decoder multiple times (i.e., one decoder run for each output branches).…”
Section: Comparison To Prior Workmentioning
confidence: 99%
See 2 more Smart Citations
“…While promising results were shown for such joint systems, most of the previous studies were limited to either simulated data [15,17,25,[28][29][30][31][32][33][34][35] or small-scale real data [11,[36][37][38]. It is because of the scarcity of training data for real meeting recordings, which takes a lot of time to precisely transcribe.…”
Section: Introductionmentioning
confidence: 99%