Online Speaker Diarization with Core Samples Selection

Yue, Yanyan; Du, Jie; He, Maokui; Yeung, YuTing; Wang, Renyu

doi:10.21437/interspeech.2022-10363

Cited by 7 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For evaluating online diarization, we used FW-STB with EEND-EDA based on four-stacked Transformers [26]. In addition, we referred to the results of various conventional online diarization methods [24], [26], [50]- [52], [70], [71] on various datasets. Some cascaded comparison methods [50], [51], [70] used the oracle SAD; for a fair comparison, we used SAD post-processing The values are from the original FW-STB paper [26].…”

Section: Experimental Settingsmentioning

confidence: 99%

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Horiguchi

Watanabe

García

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multilabel classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractorbased EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally, the results of each block are clustered on the basis of the similarity between locallycalculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a blockwise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.

show abstract

Section: Experimental Settingsmentioning

confidence: 99%

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Horiguchi

Watanabe

García

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Speaker diarisation (SD), which segments input audio to short utterances according to speaker identity, is going through a rapid breakthrough [1,2]. Based on the success of recent SD systems [3][4][5][6][7][8][9][10][11][12], online SD systems are also being developed [13][14][15][16][17][18][19][20]. In an online SD system, the system should decide the speaker label of a given short segment leveraging only current and past segments, where only a part of past segments are available.…”

Section: Introductionmentioning

confidence: 99%

“…In [15], authors modified the agglomerative hierarchical clustering (AHC) algorithm, widely adopted in offline SD systems, and proposed a checkpoint AHC with the label matching algorithm. Authors of [16] adopted a memory module for each speaker and contained selected embeddings, where VBx [9] and cosine operations on centroids were used for clustering. Wang et al [17] adapted target speaker voice activity detection (TS-VAD), a successful offline SD framework, to online SD scenarios [8,22].…”

Section: Introductionmentioning

confidence: 99%

Absolute decision corrupts absolutely: conservative online speaker diarisation

Kwon¹,

Heo²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount importance among many other factors. Thus, our proposed framework includes decreasing the number of speakers by one when the system judges that an increase in the past was faulty. We also adopt dual buffers, checkpoints and centroids, where checkpoints are combined with silhouette coefficients to estimate the number of speakers and centroids represent speakers. Again, we believe that more than one centroid can be generated from one speaker. Thus we design a clustering-based label matching technique to assign labels in realtime. The resulting system is lightweight yet surprisingly effective. The system demonstrates state-of-the-art performance on DIHARD II and III datasets, where it is also competitive in AMI and VoxConverse test sets.

show abstract