Analysis of the but Diarization System for Voxconverse Challenge

Landini, Federico; Glembek, Ondřej; Matějka, Pavel; Rohdin, Johan; Burget, Lukáš; Díez, Mireia; Silnova, Anna

doi:10.1109/icassp39728.2021.9414315

Cited by 17 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We tested the pipeline with VoxConverse corpus [23], which is an audio-visual diarization dataset consisting of over 50 hours of multi-speaker clips of human speech, extracted from videos collected on the internet. The DER achieved on VoxConverse using the BUT system is 4.41%, which is consistent with the result in [22].…”

Section: Data Pre-tagging: Speaker Segmentationsupporting

confidence: 88%

See 1 more Smart Citation

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

Liu¹,

Zhang²,

Xing³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper. 1 * co-first authors, equal contribution. 1 Code and data are available in https://github.com/Appen/UHV-OTS-Speech Preprint. Under review.

show abstract

Section: Data Pre-tagging: Speaker Segmentationsupporting

confidence: 88%

“…The BUT speaker diarization framework [22] is adopted in our data annotation pipeline for speaker segmentation and speaker clustering purposes. The speaker diarization framework generally involves an embedding stage followed by a clustering stage, which is illustrated in Fig.…”

Section: Data Pre-tagging: Speaker Segmentationmentioning

confidence: 99%

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

Liu¹,

Zhang²,

Xing³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Notice that a higher threshold value this work than those in previous works [5,10,23] caused slight underclustering. However, this underclustering was remedied using variational Bayesian (VB)-HMM-based clustering [24,25]. VB-HMM aims at reassigning a cluster index to each frame by considering the time dependencies with a proper number of clusters.…”

Section: Clusteringmentioning

confidence: 99%

GIST-AiTeR System for the Diarization Task of the 2022 VoxCeleb Speaker Recognition Challenge

Park¹,

Yu²,

Park³

et al. 2022

Preprint

View full text Add to dashboard Cite

This report describes the submission system of the GIST-AiTeR team at the 2022 VoxCeleb Speaker Recognition Challenge (VoxSRC) Track 4. Our system mainly includes speech enhancement, voice activity detection , multi-scaled speaker embedding, probabilistic linear discriminant analysis-based speaker clustering, and overlapped speech detection models. We first construct four different diarization systems according to different model combinations with the best experimental efforts. Our final submission is an ensemble system of all the four systems and achieves a diarization error rate of 5.12% on the challenge evaluation set, ranked third at the diarization track of the challenge.

show abstract

“…After that, GT-labels and GT-SAD were constructed from the ES2008a. [A-D].words.xml files generated for the AMI corpus for words only using forced alignment and HTK [20] (conveniently already extracted in the "only words" directory of [12], [24]) (GT3) and lastly constructed from those same AMI corpus files but this time including non-word vocal sounds and conveniently in the "word and vocalsounds" directory of [12], [24] (GT4 and, together with GT1, GT2 and GT3, the GTs). References to GT-labels and GT-SAD generated from specific ground truths are GT1-labels and GT1-SAD, for example.…”

Section: A Datasets and Systems Usedmentioning

confidence: 99%

Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems

McKnight

Hogg

Neo

et al. 2022

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-ofthe-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-theart systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.

show abstract

Analysis of the but Diarization System for Voxconverse Challenge

Cited by 17 publications

References 18 publications

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development

GIST-AiTeR System for the Diarization Task of the 2022 VoxCeleb Speaker Recognition Challenge

Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems

Contact Info

Product

Resources

About