2023
DOI: 10.1109/taslp.2022.3233237
|View full text |Cite
|
Sign up to set email alerts
|

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Abstract: A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multilabel classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractorbased EEND is empirically capped; it cannot deal with cases where the number of speakers appearing … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 68 publications
(147 reference statements)
0
6
0
Order By: Relevance
“…To make the number of output speakers flexible and unlimited, EEND-vector clustering (EEND-VC) [12]- [14] integrates end-to-end and clustering approaches, which deploys an EEND model for shortly divided audio blocks and then matches the inter-block speaker labels by clustering on speaker embeddings. In addition, EEND-GLA [15], [16] calculates local attractors from each short block and finds the speaker correspondence based on similarities between inter-block attractors. As its training only requires relative speaker labels within the recording, EEND-GLA is practical for adapting models on in-the-wild datasets without globally unique speaker labels.…”
Section: B End-to-end Neural Diarizationmentioning
confidence: 99%
See 1 more Smart Citation
“…To make the number of output speakers flexible and unlimited, EEND-vector clustering (EEND-VC) [12]- [14] integrates end-to-end and clustering approaches, which deploys an EEND model for shortly divided audio blocks and then matches the inter-block speaker labels by clustering on speaker embeddings. In addition, EEND-GLA [15], [16] calculates local attractors from each short block and finds the speaker correspondence based on similarities between inter-block attractors. As its training only requires relative speaker labels within the recording, EEND-GLA is practical for adapting models on in-the-wild datasets without globally unique speaker labels.…”
Section: B End-to-end Neural Diarizationmentioning
confidence: 99%
“…Nevertheless, permutationinvariant training in EEND-based methods causes performance degradation when the number of speakers increases in long audios. Although a few studies [12]- [16] have explored the unsupervised clustering to address this problem, their results are still unsatisfactory. Recently, Target-Speaker Voice Activity Detection (TS-VAD) approaches [17]- [20] become attractive, which combines advantage of modularized methods and endto-end neural networks.…”
Section: Introductionmentioning
confidence: 99%
“…Another approach to tackle the issue of speaker permutation ambiguity involves maintaining acoustic features within a speaker-tracing buffer [26,30]. On top of this approach, Horiguchi et al [31] enhanced EEND-GLA for online usage by integrating a speaker-tracing buffer. Remarkably, this method achieved SOTA performance across various datasets, outperforming numerous offline modularized speaker diarization systems.…”
Section: Bmentioning
confidence: 99%
“…During training stage, this inconsistency can be solved using a permutation free objective [20,29], but it limits the EEND to be extended for online purpose as the order of speakers are always changing as a new signal coming. One solution is to employ a buffer to store both prior inputs and results to align the current results with previous one [26,30,31], but this usually requires a large buffer for a satisfying performance.…”
Section: Introductionmentioning
confidence: 99%
“…Additionally, Horiguchi et al [21] introduced the EEND-EDA system, which utilizes an LSTM encoder-decoder network to model attractors for each speaker. Furthermore, researchers also proposed two-stage hybrid systems [22], [23] to address the challenge of handling a flexible number of speakers. These systems first output diarization results for short segments with a limited number of speakers using EEND, and then employ a clustering algorithm to solve the inter-segment speaker permutation problem.…”
Section: Introductionmentioning
confidence: 99%