Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-560
|View full text |Cite
|
Sign up to set email alerts
|

End-To-End Speaker Segmentation for Overlap-Aware Resegmentation

Abstract: Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 57 publications
(38 citation statements)
references
References 13 publications
0
37
0
1
Order By: Relevance
“…The combination of an end-to-end approach and clustering is a promising direction to solve the problem of the limitation of the number of speakers. For example, EEND as postprocessing [23] and overlap-aware resegmentation [13] use EEND to refine the results obtained with cascaded diarization systems. The initial results are based on clustering of speaker embeddings; hence, the number of output speakers can be arbitrary.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The combination of an end-to-end approach and clustering is a promising direction to solve the problem of the limitation of the number of speakers. For example, EEND as postprocessing [23] and overlap-aware resegmentation [13] use EEND to refine the results obtained with cascaded diarization systems. The initial results are based on clustering of speaker embeddings; hence, the number of output speakers can be arbitrary.…”
Section: Related Workmentioning
confidence: 99%
“…For evaluating offline diarization, we utilized several cascaded methods [13], [14], [22], [69] and end-to-end methods [16], [17], [29], [32] for comparison. For evaluating online diarization, we used FW-STB with EEND-EDA based on four-stacked Transformers [25].…”
Section: Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…We used a simple speaker diarization pipeline including the following steps: voice activity detection (VAD), overlapped speech detection (OSD; both from [38,39]), fixedlength segmentation, clustering, and post-processing. The postprocessing includes merging the adjacent sub-segments from the same speaker and distributing the overlapped segments equally among the adjacent segments with different speakers.…”
Section: Experiments Setupmentioning
confidence: 99%
“…We also compare to the neural speaker segmentation method implemented in pyannote.audio [27] that performs joint voice activity detection, speaker segmentation and overlapped speech detection. Similarly to the original EEND approach [28], here speaker segmentation is modeled as a multi-label classification problem using permutation-invariant training.…”
Section: Baselinesmentioning
confidence: 99%