A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization

Evans, Nicholas; Bozonnet, Simon; Wang, Dong; Fredouille, Corinne; Troncy, Raphaël

doi:10.1109/tasl.2011.2159710

Cited by 37 publications

(27 citation statements)

References 26 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The top-down approach is reported to give worse performance on the NIST RT database [25] and has thus received less attention. However, paper [40] makes a thorough comparative study of these two approaches and demonstrates that these two approaches have similar performance.…”

Section: Top-down Approachmentioning

confidence: 99%

“…Although random initialization works well in most cases, LCM and VB systems tend to assign the segments to each speaker evenly in the case where a single speaker dominates the whole conversation, leading to poor results. According to the comparative study [40], we know that the bottom-up approach will capture comparatively purer models. Therefore, we recommend an informative AHC initialization method, similar to our previous paper [51].…”

Section: Ahc Initializationmentioning

confidence: 99%

See 1 more Smart Citation

Latent class model with application to speaker diarization

Chen

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

In this paper, we apply a latent class model (LCM) to the task of speaker diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in that it uses soft information and avoids premature hard decisions in its iterations. In contrast to the VB method, which is based on a generative model, LCM provides a framework allowing both generative and discriminative models. The discriminative property is realized through the use of i-vector (Ivec), probabilistic linear discriminative analysis (PLDA), and a support vector machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and LCM-Ivec-Hybrid are introduced. In addition, three further improvements are applied to enhance its performance. 1) Adding neighbor windows to extract more speaker information for each short segment. 2) Using a hidden Markov model to avoid frequent speaker change points. 3) Using an agglomerative hierarchical cluster to do initialization and present hard and soft priors, in order to overcome the problem of initial sensitivity. Experiments on the National Institute of Standards and Technology Rich Transcription 2009 speaker diarization database, under the condition of a single distant microphone, show that the diarization error rate (DER) of the proposed methods has substantial relative improvements compared with mainstream systems. Compared to the VB method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial conditions also show that the proposed LCM-Ivec-Hybrid system has the best overall performance.

show abstract

Section: Top-down Approachmentioning

confidence: 99%

Section: Ahc Initializationmentioning

confidence: 99%

Latent class model with application to speaker diarization

Chen

et al. 2019

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…Speaker diarization [1][2][3] is an unsupervised statistical pattern recognition task which aims to determine 'who spoke when' in a given audio stream. Speaker diarization has become a key, enabling technology in a wide variety of tasks including document processing, structuring and navigation, information retrieval, meta-data extraction and copyright detection.…”

Section: Introductionmentioning

confidence: 99%

“…Historically, the state-of-the art in speaker diarization for meetings has evolved around the implementation of offline systems, such as bottom-up and top-down hierarchical clustering approaches [3][4][5]. In both cases, speakers are modelled with Gaussian mixture models (GMMs) which are interconnected to form an ergodic hidden Markov model (HMM) in which the transitions represent speaker turns.…”

Section: Introductionmentioning

confidence: 99%

Adaptive and online speaker diarization for meeting data

Soldi

Beaugeant²,

Evans

2015

2015 23rd European Signal Processing Conference (EUSIPCO)

Self Cite

View full text Add to dashboard Cite

Speaker diarization aims to determine 'who spoke when' in a given audio stream. Different applications, such as document structuring or information retrieval have led to the exploration of speaker diarization in many different domains, from broadcast news to lectures, phone conversations and meetings. Almost all current diarization systems are offline and ill-suited to the growing need for online or real-time diarization, stemming from the increasing popularity of powerful, mobile smart devices. While a small number of such systems have been reported, truly online diarization systems for challenging and highly spontaneous meeting data are lacking. This paper reports our work to develop an adaptive and online diarization system using the NIST Rich Transcription meetings corpora. While not dissimilar to those previously reported for less challenging domains, high diarization error rates illustrate the challenge ahead and lead to some ideas to improve performance through future research.

show abstract

“…As for speaker diarization, many research works are based on agglomerative and divisive hierachical manner such as top-down or bottom-up algorithms [2]. The bottom-up approach is by far the most popular system, that is, hierachical agglomerative clustering (HAC).…”

Section: Introductionmentioning

confidence: 99%

Variational Bayes based I-vector for speaker diarization of telephone conversations

Zheng

Zhang

et al. 2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we investigate the variational Bayes based I-vector method for speaker diarization of telephone conversations. The motivation of the proposed algorithm is to utilize variational Bayesian framework and exploit potential channel effect of total variability modeling for diarization of conversation side. Other three well-known techniques are compared as follows: K-means clustering for eigenvoices and I-vector speaker diarization, and variational Bayes applied to eigenvoices. Performance evaluations are conducted on the summed-channel telephone data from the 2008 NIST speaker recognition evaluation. The paper discusses how the performance is influenced by different modules, e.g., VAD, initial speaker clustering and Viterbi re-segmentation. Comparison experiments show the interest of variational Bayesian probabilistic framework for speaker diarization.

show abstract

A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization

Cited by 37 publications

References 26 publications

Latent class model with application to speaker diarization

Latent class model with application to speaker diarization

Adaptive and online speaker diarization for meeting data

Variational Bayes based I-vector for speaker diarization of telephone conversations

Contact Info

Product

Resources

About