2021 IEEE Spoken Language Technology Workshop (SLT) 2021
DOI: 10.1109/slt48900.2021.9383556
|View full text |Cite
|
Sign up to set email alerts
|

Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
34
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 51 publications
(34 citation statements)
references
References 35 publications
0
34
0
Order By: Relevance
“…2. Note that the system named "Joint System 2 (J2)" is a new pipeline that we propose in this paper while other three systems are known pipelines for SA-ASR that has been investigated in prior works (such as [13,32]).…”
Section: Modular and Joint Systems For Speaker-attributed Asrmentioning
confidence: 99%
See 1 more Smart Citation
“…2. Note that the system named "Joint System 2 (J2)" is a new pipeline that we propose in this paper while other three systems are known pipelines for SA-ASR that has been investigated in prior works (such as [13,32]).…”
Section: Modular and Joint Systems For Speaker-attributed Asrmentioning
confidence: 99%
“…There have been a lot of studies on microphone array recordings to improve speech separation [6][7][8], speaker diarization [6,9] and ASR systems [10,11] by using spatial information. On the other hand, SA-ASR based on a single microphone is still highly challenging, and only limited amount of studies have been conducted for a fully automatic SA-ASR system on the monaural long-form audio [12][13][14].…”
Section: Introductionmentioning
confidence: 99%
“…This is what the automatic system is trying to learn. For many downstream tasks of speech processing, such as speaker diarization [2] and automatic speech recognition [3], speech separation is a necessary pre-processing.…”
Section: Introductionmentioning
confidence: 99%
“…Many variants of this approach have been investigated such as the methods using agglomerate hierarchical clustering (AHC) [2], spectral clustering (SC) [3], and variational Bayesian inference [4,5]. While these approaches showed a good performance for difficult test conditions [6], they cannot handle overlapped speech [7]. Several extensions were also proposed to handle overlapping speech, such as using overlapping detection [8] and speech separation [9].…”
Section: Introductionmentioning
confidence: 99%
“…Target-speaker voice activity detection (TS-VAD) [15] is another approach where the neural network is trained to estimate speech activities of all the speakers specified by a set of pre-estimated speaker embeddings. Of these speaker diarization methods, TS-VAD achieved the state-ofthe-art (SOTA) results in several diarization tasks [7,15] including recent international competitions [16,17]. On the other hand, TS-VAD has a limitation that the number of recognizable speakers is bounded by the number of output nodes of the model.…”
Section: Introductionmentioning
confidence: 99%