Adapting Speaker Embeddings for Speaker Diarisation

Kwon, Youngki; Jung, Jee-weon; Heo, Hee-Soo; Kim, You Jin; Lee, Bong-Jin; Chung, Joon Son

doi:10.21437/interspeech.2021-448

Cited by 10 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, instead of tuning the threshold for each domain data, we adopt clustering with a silhouette coefficient trick. Some studies [10,11,24,25] already composed their clustering-based SD systems using silhouette coefficient, and those systems show superior performance on various datasets without threshold tuning.…”

Section: Initial Clustering Phasementioning

confidence: 99%

“…Then we extract speaker embeddings using a sliding window with a 1.5s window and a 0.5s shift. We utilise the H / ASP architecture [31] as our model and prepare the model under the training protocol described in [10].…”

Section: Implementation Detailsmentioning

confidence: 99%

“…All three systems share the identical speaker embedding extractor, and two offline baselines share the clustering algorithm addressed in 2.2. The difference between offline-base and online is the clustering algorithm; offline-best refers to the system where we apply feature enhancement techniques proposed in [10].…”

Section: Offline Vs Onlinementioning

confidence: 99%

“…Speaker diarisation (SD), which segments input audio to short utterances according to speaker identity, is going through a rapid breakthrough [1,2]. Based on the success of recent SD systems [3][4][5][6][7][8][9][10][11][12], online SD systems are also being developed [13][14][15][16][17][18][19][20]. In an online SD system, the system should decide the speaker label of a given short segment leveraging only current and past segments, where only a part of past segments are available.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Absolute decision corrupts absolutely: conservative online speaker diarisation

Kwon¹,

Heo²,

Lee³

et al. 2022

Preprint

View full text Add to dashboard Cite

Our focus lies in developing an online speaker diarisation framework which demonstrates robust performance across diverse domains. In online speaker diarisation, outputs generated in real-time are irreversible, and a few misjudgements in the early phase of an input session can lead to catastrophic results. We hypothesise that cautiously increasing the number of estimated speakers is of paramount importance among many other factors. Thus, our proposed framework includes decreasing the number of speakers by one when the system judges that an increase in the past was faulty. We also adopt dual buffers, checkpoints and centroids, where checkpoints are combined with silhouette coefficients to estimate the number of speakers and centroids represent speakers. Again, we believe that more than one centroid can be generated from one speaker. Thus we design a clustering-based label matching technique to assign labels in realtime. The resulting system is lightweight yet surprisingly effective. The system demonstrates state-of-the-art performance on DIHARD II and III datasets, where it is also competitive in AMI and VoxConverse test sets.

show abstract

Section: Initial Clustering Phasementioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Offline Vs Onlinementioning

confidence: 99%