Multi-Scale Speaker Diarization with Neural Affinity Score Fusion

Park, Tae‐Jin; Kumar, Manoj; Narayanan, Shrikanth

doi:10.1109/icassp39728.2021.9414578

Cited by 7 publications

(2 citation statements)

References 17 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To address this, one line of research has combined embeddings extracted with different window sizes and shifts. In [Park et al, 2021], the affinity matrices of three different configurations are processed with an NN and the resulting matrix is used for spectral clustering. In [Kwon et al, 2022], embeddings from different resolutions are paired with vectors that denote which scale they were extracted with (analogously to positional encoding in Transformer models) and they are processed with an attention mechanism to obtain similarities between the embeddings (through the inherent attention weights).…”

Section: Neural Network For Speaker Embeddings and Affinity Matricesmentioning

confidence: 99%

From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization

Landini¹,

Lozano-Díez²,

Díez³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed simulated conversations (SC) have shown remarkable improvements over the original simulated mixtures (SM). In this work, we create SC with multiple speakers per conversation and show that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage. We also create SC with wide-band public audio sources and present an analysis on several evaluation sets. Together with this publication, we release the recipes for generating such data and models trained on public sets as well as the implementation to efficiently handle multiple speakers per conversation and an auxiliary voice activity detection loss.

show abstract

Section: Neural Network For Speaker Embeddings and Affinity Matricesmentioning

confidence: 99%

From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization

Landini¹,

Lozano-Díez²,

Díez³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…These systems involve a dedicated model trained to detect the exact moment when speakers change. To deal with a trade-off between long and short segment lengths, a group of works employs multi-scale segmentation [6,7]. They use multiple scales (segment lengths) and fuse the similarity scores between embeddings obtained from the results of each scale.…”

Section: Introduction and Related Workmentioning

confidence: 99%

Building a Speaker Diarization System: Lessons from VoxSRC 2023

Karamyan,

Kirakosyan

2023

MPCS

View full text Add to dashboard Cite

Speaker diarization is the process of partitioning an audio recording into segments corresponding to individual speakers. In this paper, we present a robust speaker diarization system and describe its architecture. We focus on discussing the key components necessary for building a strong diarization system, such as voice activity detection (VAD), speaker embedding, and clustering. Our system emerged as the winner in the Voxceleb Speaker Recognition Challenge (VoxSRC) 2023, a widely recognized competition for evaluating speaker diarization systems.

show abstract