But System for the Second Dihard Speech Diarization Challenge

Landini, Federico; Wang, Shuai; Díez, Mireia; Burget, Lukáš; Matějka, Pavel; Žmolíková, Kateřina; Mošner, Ladislav; Silnova, Anna; Plchot, Oldřich; Novotny, Ondrej; Zeinali, Hossein; Rohdin, Johan

doi:10.1109/icassp40776.2020.9054251

Cited by 46 publications

(53 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For comparison, we also report the performance of several existing DIHARD challenge II submissions. The challenge top system by BUT achieves a DER value of 18.09% on the DIHARD II dev set [60]. However, it is mentioned in the paper that in their system, PLDA was adapted on the same development set.…”

Section: ) Analysis Of Experimental Resultsmentioning

confidence: 99%

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Pal

Kumar

Peri

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation.

show abstract

Section: ) Analysis Of Experimental Resultsmentioning

confidence: 99%

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Pal

Kumar

Peri

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…We considered two methods for signal preprocessing: the speech enhancement method based on a long short-term memory (LSTM) network trained on simulated data [9] (also used in the baseline) and the weighted prediction error (WPE) [10,11] as it had proved to be useful in the Second DIHARD challenge [12]. In our experiments, we saw that using the LSTM-based speech enhancer was beneficial while the WPE method was actually harmful.…”

Section: Signal Pre-processingmentioning

confidence: 99%

“…• a deep neural network (DNN) based system with three feedforward layers receiving as input ±5 stacked frames and trained to output 10ms frame decisions (silence / speech) [12]. It was trained on part of the second DIHARD development set (the rest was used for validation while training), the train set of the "fullcorpus" partition of AMI 2 [13] (the test and development sets were used for validation while training), ICSI [14] and ISL [15] meetings.…”

Section: Voice Activity Detectionmentioning

confidence: 99%

“…However, the model benefits from using a more sensible initial assignment. As in previous work [12], the x-vectors extracted from an input recording are clustered by means of agglomerative hierarchical clustering (AHC) with similarity metric based on probabilistic linear discriminant analysis (PLDA) [21] log-likelihood ratio scores, as used for speaker verification. The PLDA model for this purpose was trained on x-vectors extracted from concatenated speech segments from VoxCeleb 2 [6] which are mean-centered, whitened to have identity covariance matrix and length-normalized [22].…”

Section: Initial Clusteringmentioning

confidence: 99%

See 1 more Smart Citation

Analysis of the but Diarization System for Voxconverse Challenge

Landini

Glembek

Matějka

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper describes the system developed by the BUT team for the fourth track of the VoxCeleb Speaker Recognition Challenge, focusing on diarization on the VoxConverse dataset. The system consists of signal pre-processing, voice activity detection, speaker embedding extraction, an initial agglomerative hierarchical clustering followed by diarization using a Bayesian hidden Markov model, a reclustering step based on per-speaker global embeddings and overlapped speech detection and handling. We provide comparisons for each of the steps and share the implementation of the most relevant modules of our system. Our system scored second in the challenge in terms of the primary metric (diarization error rate) and first according to the secondary metric (Jaccard error rate).

show abstract

“…These are then refined using a separate HMM. This first-pass AHC and second-pass HMM approach has proven to be effective on challenging diarisation tasks [14], and this is the approach that is adopted in this report.…”

Section: Introductionmentioning

confidence: 99%

Hidden Markov Model Diarisation with Speaker Location Information

Wong

Xiao

Gong

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaker diarisation methods often rely on speaker embeddings to cluster together the segments of audio that are uttered by the same speaker. When the audio is captured using a microphone array, it is possible to estimate the locations of where the sounds originate from. This location information may be complementary to the speaker embeddings in the diarisation processes. This report proposes to extend the Hidden Markov Model (HMM) clustering method, to enable the use of speaker location information. The HMM observation log-likelihood for the speaker location can take the form of a KLdivergence, when the speaker location is represented as a discrete posterior distribution of the probabilities that the sound originated from each possible location. Experimental results on a Microsoft rich meeting transcription task show that using speaker location information with the proposed HMM modification can yield performance improvements over using speaker embeddings alone.

show abstract

But System for the Second Dihard Speech Diarization Challenge

Cited by 46 publications

References 11 publications

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Analysis of the but Diarization System for Voxconverse Challenge

Hidden Markov Model Diarisation with Speaker Location Information

Contact Info

Product

Resources

About