2022
DOI: 10.1016/j.csl.2021.101254
|View full text |Cite
|
Sign up to set email alerts
|

Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
83
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 100 publications
(84 citation statements)
references
References 15 publications
1
83
0
Order By: Relevance
“…Before decoding with TS-VAD, we need an initial diarization result to get each speaker's segments for extracting corresponding i-vectors. M2MeT baseline [9] provides AHC with Variational Bayesian HMM clustering (VBx) [13]. First, for the speaker embedding network, we replace the baseline ResNet with ECAPA-tdnn(C=512) [14].…”
Section: Clustering-based Speaker Diarizationmentioning
confidence: 99%
“…Before decoding with TS-VAD, we need an initial diarization result to get each speaker's segments for extracting corresponding i-vectors. M2MeT baseline [9] provides AHC with Variational Bayesian HMM clustering (VBx) [13]. First, for the speaker embedding network, we replace the baseline ResNet with ECAPA-tdnn(C=512) [14].…”
Section: Clustering-based Speaker Diarizationmentioning
confidence: 99%
“…Within-show speaker diarization is a very active field of research in which deep learning approaches have recently reach the performance of more classic methods based on Hierarchical Agglomerative Clustering (HAC) [1], K-Means or Spectral Clustering [11] or variational-bayesian modeling [12]. Recent neural approaches have shown tremendous improvement for audio recordings involving a limited number of speakers [13][14][15][16]; however, the inherent difficulty of speaker permutation, often addressed using a PIT loss (permutation invariant training) does not allow current neural end-to-end systems to perform as well as HAC based approaches when dealing with a large number of speaker per audio file (>7) as explained in [17].…”
Section: Related Workmentioning
confidence: 99%
“…E 3 FS 3 will include a diarization tool based on the VBx algorithm, which had the best performance in the DIHARD’19 diarization challenge [ [6] , [7] , [8] , [9] ]; however, all data that were used for training and validation in the context of the present paper were supplied already diarized, so this part of the system was not validated as part of the E 3 FS 3 α validation.…”
Section: E 3 Fs 3 Core Software...mentioning
confidence: 99%