Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Kanda, Naoyuki; Wu, Jihuai; Wu, Yu; Xiong, Xiang; Meng, Ziyang; Wang, Xiaofei; Gaur, Yashesh; Chen, Zhuo; Li, Jinyu; Yoshioka, Takuya

doi:10.48550/arxiv.2202.00842

Cited by 1 publication

(10 citation statements)

References 32 publications

(59 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1. The t-SOT model was found to significantly outperform prior multitalker ASR models in both the recognition accuracy and latency while keeping the model architecture as simple as conventional single-talker ASR models [29].…”

Section: T-sotmentioning

confidence: 96%

“…The t-SOT framework was recently proposed to recognize multi-talker conversations with low latency [29]. In t-SOT, we assume up to M utterances are overlapping at the same time in the input audio.…”

Section: T-sotmentioning

confidence: 99%

“…Here, we explain the t-SOT for M = 2. Refer [29] for a case with M > 2. In t-SOT, the transcriptions for multiple speakers are serialized into a single sequence of recognition tokens (e.g., words, subwords) by sorting the tokens in a chronological order.…”

Section: T-sotmentioning

confidence: 99%

“…For the ASR block, we used a 18-layer or 36-layer TT (TT-18 or TT-36 in short) with the chunk-wise look-ahead proposed in [33] by using exactly the same configuration as in [29]. Each transformer block consisted of a 512-dim MHA with 8 heads and a 2048-dim point-wise feedforward layer.…”

Section: Experimental Settingsmentioning

confidence: 99%

“…To address these limitations, we propose a novel streaming SA-ASR model that works with low latency even for overlapping speech. Our model is based on token-level serialized output training (t-SOT) [29], which was recently proposed for low-latency multi-talker speech transcription. To further estimate the speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from nonoverlapping speech but also from overlapping speech.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Kanda¹,

Wu²,

Wu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what" with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoderdecoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

show abstract

Section: T-sotmentioning

confidence: 96%

Section: T-sotmentioning

confidence: 99%

Section: T-sotmentioning

confidence: 99%

Section: Experimental Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations