2020
DOI: 10.48550/arxiv.2011.11671
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Streaming Multi-speaker ASR with RNN-T

Abstract: Recent research shows end-to-end ASR systems can recognize overlapped speech from multiple speakers. However, all published works have assumed no latency constraints during inference, which does not hold for most voice assistant interactions. This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) that has been shown to provide high recognition accuracy at a low latency online recognition regime. We investigate two approaches to multi-speaker model training … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 19 publications
0
2
0
Order By: Relevance
“…To overcome this suboptimality, there has been a series of studies for multi-talker ASR that directly transcribes multiple utterances from overlapped speech. One popular approach is using a neural network that has multiple output layers, each of which recognizes one speaker [16,17,18,19,20,21,22]. Permutation invariant training (PIT) [23] is usually used to train such a multiple-output models.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…To overcome this suboptimality, there has been a series of studies for multi-talker ASR that directly transcribes multiple utterances from overlapped speech. One popular approach is using a neural network that has multiple output layers, each of which recognizes one speaker [16,17,18,19,20,21,22]. Permutation invariant training (PIT) [23] is usually used to train such a multiple-output models.…”
Section: Introductionmentioning
confidence: 99%
“…While promising results were shown, all the previous studies were limited to either simulated data [16,17,18,19,20,21,22,24,25,26,27] or small-scale real data [7,9,8,10]. It is due to the difficulty in collecting real meeting recordings with precise transcriptions at large-scale.…”
Section: Introductionmentioning
confidence: 99%