2022
DOI: 10.48550/arxiv.2202.00842
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multitalker ASR models using multiple output layers, the t-SOT model has only a single output layer that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of "virtual" output channels is introduced to keep track of the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(10 citation statements)
references
References 32 publications
(59 reference statements)
0
10
0
Order By: Relevance
“…1. The t-SOT model was found to significantly outperform prior multitalker ASR models in both the recognition accuracy and latency while keeping the model architecture as simple as conventional single-talker ASR models [29].…”
Section: T-sotmentioning
confidence: 96%
See 4 more Smart Citations
“…1. The t-SOT model was found to significantly outperform prior multitalker ASR models in both the recognition accuracy and latency while keeping the model architecture as simple as conventional single-talker ASR models [29].…”
Section: T-sotmentioning
confidence: 96%
“…The t-SOT framework was recently proposed to recognize multi-talker conversations with low latency [29]. In t-SOT, we assume up to M utterances are overlapping at the same time in the input audio.…”
Section: T-sotmentioning
confidence: 99%
See 3 more Smart Citations