End-To-End Speaker Segmentation for Overlap-Aware Resegmentation

Bredin, Hervé; Laurent, Antoine

doi:10.21437/interspeech.2021-560

Cited by 57 publications

(38 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The combination of an end-to-end approach and clustering is a promising direction to solve the problem of the limitation of the number of speakers. For example, EEND as postprocessing [23] and overlap-aware resegmentation [13] use EEND to refine the results obtained with cascaded diarization systems. The initial results are based on clustering of speaker embeddings; hence, the number of output speakers can be arbitrary.…”

Section: Related Workmentioning

confidence: 99%

“…For evaluating offline diarization, we utilized several cascaded methods [13], [14], [22], [69] and end-to-end methods [16], [17], [29], [32] for comparison. For evaluating online diarization, we used FW-STB with EEND-EDA based on four-stacked Transformers [25].…”

Section: Experimental Settingsmentioning

confidence: 99%

“…In offline diarization, EEND-GLA-Small and EEND-GLA-Large improved the DERs from EEND-EDA, especially when the number of speakers was higher than four. Compared with the cascaded method [72] or the cascaded method incorporated with EEND for post-processing [13], EEND-GLA-Large performed on par with them when the § The values are from the original FW-STB paper [25]. number of speakers was low, but not when the number of speakers was high.…”

Section: A Evaluation Of the Variations Of Speaker-tracing Buffermentioning

confidence: 99%

See 2 more Smart Citations

Online Neural Diarization of Unlimited Numbers of Speakers

Horiguchi¹,

Watanabe²,

Garcia³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Experimental Settingsmentioning

confidence: 99%

Section: A Evaluation Of the Variations Of Speaker-tracing Buffermentioning

confidence: 99%

See 1 more Smart Citation

Online Neural Diarization of Unlimited Numbers of Speakers

Horiguchi¹,

Watanabe²,

Garcia³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…We used a simple speaker diarization pipeline including the following steps: voice activity detection (VAD), overlapped speech detection (OSD; both from [38,39]), fixedlength segmentation, clustering, and post-processing. The postprocessing includes merging the adjacent sub-segments from the same speaker and distributing the overlapped segments equally among the adjacent segments with different speakers.…”

Section: Experiments Setupmentioning

confidence: 99%

Magnitude-aware Probabilistic Speaker Embeddings

Kuzmin,

Fedorov,

Sholokhov

2022

Preprint

View full text Add to dashboard Cite

Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality of the corresponding inputs. This paper explores the properties of the magnitudes of the embeddings related to quality assessment and out-of-distribution detection. We propose a new probabilistic speaker embedding extractor using the information encoded in the embedding magnitude and leverage it in the speaker verification pipeline. We also propose several quality-aware diarization methods and incorporate the magnitudes in those. Our results indicate significant improvements over magnitude-agnostic baselines both in speaker verification and diarization tasks.

show abstract

“…We also compare to the neural speaker segmentation method implemented in pyannote.audio [27] that performs joint voice activity detection, speaker segmentation and overlapped speech detection. Similarly to the original EEND approach [28], here speaker segmentation is modeled as a multi-label classification problem using permutation-invariant training.…”

Section: Baselinesmentioning

confidence: 99%

Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

Kalda¹,

Alumäe²

2022

Preprint

View full text Add to dashboard Cite

In this paper, we present a novel training method for speaker change detection models. Speaker change detection is often viewed as a binary sequence labelling problem. The main challenges with this approach are the vagueness of annotated change points caused by the silences between speaker turns and imbalanced data due to the majority of frames not including a speaker change. Conventional training methods tackle these by artificially increasing the proportion of positive labels in the training data. Instead, the proposed method uses an objective function which encourages the model to predict a single positive label within a specified collar. This is done by marginalizing over all possible subsequences that have exactly one positive label within the collar. Experiments on English and Estonian datasets show large improvements over the conventional training method. Additionally, the model outputs have peaks concentrated to a single frame, removing the need for postprocessing to find the exact predicted change point which is particularly useful for streaming applications.

show abstract

End-To-End Speaker Segmentation for Overlap-Aware Resegmentation

Cited by 57 publications

References 13 publications

Online Neural Diarization of Unlimited Numbers of Speakers

Online Neural Diarization of Unlimited Numbers of Speakers

Magnitude-aware Probabilistic Speaker Embeddings

Collar-aware Training for Streaming Speaker Change Detection in Broadcast Speech

Contact Info

Product

Resources

About