ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414315
|View full text |Cite
|
Sign up to set email alerts
|

Analysis of the but Diarization System for Voxconverse Challenge

Abstract: This paper describes the system developed by the BUT team for the fourth track of the VoxCeleb Speaker Recognition Challenge, focusing on diarization on the VoxConverse dataset. The system consists of signal pre-processing, voice activity detection, speaker embedding extraction, an initial agglomerative hierarchical clustering followed by diarization using a Bayesian hidden Markov model, a reclustering step based on per-speaker global embeddings and overlapped speech detection and handling. We provide comparis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
7
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(8 citation statements)
references
References 18 publications
1
7
0
Order By: Relevance
“…We tested the pipeline with VoxConverse corpus [23], which is an audio-visual diarization dataset consisting of over 50 hours of multi-speaker clips of human speech, extracted from videos collected on the internet. The DER achieved on VoxConverse using the BUT system is 4.41%, which is consistent with the result in [22].…”
Section: Data Pre-tagging: Speaker Segmentationsupporting
confidence: 88%
See 1 more Smart Citation
“…We tested the pipeline with VoxConverse corpus [23], which is an audio-visual diarization dataset consisting of over 50 hours of multi-speaker clips of human speech, extracted from videos collected on the internet. The DER achieved on VoxConverse using the BUT system is 4.41%, which is consistent with the result in [22].…”
Section: Data Pre-tagging: Speaker Segmentationsupporting
confidence: 88%
“…The BUT speaker diarization framework [22] is adopted in our data annotation pipeline for speaker segmentation and speaker clustering purposes. The speaker diarization framework generally involves an embedding stage followed by a clustering stage, which is illustrated in Fig.…”
Section: Data Pre-tagging: Speaker Segmentationmentioning
confidence: 99%
“…Notice that a higher threshold value this work than those in previous works [5,10,23] caused slight underclustering. However, this underclustering was remedied using variational Bayesian (VB)-HMM-based clustering [24,25]. VB-HMM aims at reassigning a cluster index to each frame by considering the time dependencies with a proper number of clusters.…”
Section: Clusteringmentioning
confidence: 99%
“…After that, GT-labels and GT-SAD were constructed from the ES2008a. [A-D].words.xml files generated for the AMI corpus for words only using forced alignment and HTK [20] (conveniently already extracted in the "only words" directory of [12], [24]) (GT3) and lastly constructed from those same AMI corpus files but this time including non-word vocal sounds and conveniently in the "word and vocalsounds" directory of [12], [24] (GT4 and, together with GT1, GT2 and GT3, the GTs). References to GT-labels and GT-SAD generated from specific ground truths are GT1-labels and GT1-SAD, for example.…”
Section: A Datasets and Systems Usedmentioning
confidence: 99%