Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Pal, Mahesh; Kumar, Manoj; Peri, Raghuveer; Park, Tae‐Jin; Kim, So Hyun; Lord, Catherine; Bishop, Somer; Narayanan, Shrikanth

doi:10.1109/icassp40776.2020.9053952

Cited by 15 publications

(12 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We show the DER values of the PLDA+AHC approach for three different segment scales (1.5, 1.0, and 0.5 s) and how the performance of the diarization changes with the distance measure and clustering method. We also list the lowest DER value that we could find that has appeared in a published paper on speaker diarization [20,24], including the CHAES-eval results of our previous study [18].…”

Section: Discussionmentioning

confidence: 99%

“…CHAES (LDC97S42) is a corpus that contains only English speech data. CHAES is divided into train (80), dev (20), and eval (20) splits.…”

Section: Call Home American English Speech (Chaes)mentioning

confidence: 99%

“…The AMI database consists of meeting recordings from multiple sites. We evaluated our proposed systems on the subset of the AMI corpus, which is a commonly used evaluation set that has appeared in numerous previous studies, and we followed the splits (train, dev, and eval) applied in these studies [20][21][22].…”

Section: Ami Meeting Corpusmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Scale Speaker Diarization with Neural Affinity Score Fusion

Park

Kumar

Narayanan

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Predicting the speaker's identity of short speech segments in human dialogue has been considered one of the most challenging problems in speech signal processing. Speaker representations of short speech segments tend to be unreliable, resulting in poor fidelity of speaker representations in tasks requiring speaker recognition. In this paper, we propose an unconventional method that tackles the trade-off between temporal resolution and the quality of the speaker representations. To find a set of weights that balance the scores from multiple temporal scales of segments, a neural affinity score fusion model is presented. Using the CALLHOME dataset, we show that our proposed multi-scale segmentation and integration approach can achieve a state-of-the-art diarization performance.

show abstract

Section: Discussionmentioning

confidence: 99%

“…CHAES (LDC97S42) is a corpus that contains only English speech data. CHAES is divided into train (80), dev (20), and eval (20) splits.…”

Section: Call Home American English Speech (Chaes)mentioning

confidence: 99%

See 1 more Smart Citation

Multi-Scale Speaker Diarization with Neural Affinity Score Fusion

Park

Kumar

Narayanan

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The channels in the microphone array are beamformed with the standard BeamformIt toolkit [27]. The same split is used in many other works [6,8,[28][29][30].…”

Section: Datasetsmentioning

confidence: 99%

ECAPA-TDNN Embeddings for Speaker Diarization

Dawalatabad¹,

Ravanelli²,

Grondin³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model.In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminated versions of the same signal within the same training batch. The ECAPA-TDNN model turned out to provide robust speaker embeddings under both close-talking and distant-talking conditions. Our results on the popular AMI meeting corpus show that our system significantly outperforms recently proposed approaches.

show abstract

“…Separately, the application of end-to-end modeling for two speaker conversational data has been explored in [19]. In the end-to-end learning, the input features are fed to a model where the loss is either permutation-invariant cross entropy [39], [40] or clustering based [41]. Further to refine the boundaries of segmentation output in speaker diarization, a second re-segmentation step involving frame-level (20-30ms) modeling [26], [27] can be performed.…”

Section: Related Workmentioning

confidence: 99%

Deep Self-Supervised Hierarchical Clustering for Speaker Diarization

Singh¹,

Ganapathy²

2020

Interspeech 2020

View full text Add to dashboard Cite

Automatic speaker diarization techniques typically involve a two-stage processing approach where audio segments of fixed duration are converted to vector representations in the first stage. This is followed by an unsupervised clustering of the representations in the second stage. In most of the prior approaches, these two stages are performed in an isolated manner with independent optimization steps. In this paper, we propose a representation learning and clustering algorithm that can be iteratively performed for improved speaker diarization. The representation learning is based on principles of selfsupervised learning while the clustering algorithm is a graph structural method based on path integral clustering (PIC). The representation learning step uses the cluster targets from PIC and the clustering step is performed on embeddings learned from the self-supervised deep model. This iterative approach is referred to as self-supervised clustering (SSC). The diarization experiments are performed on CALLHOME and AMI meeting datasets. In these experiments, we show that the SSC algorithm improves significantly over the baseline system (relative improvements of 13% and 59% on CALLHOME and AMI datasets respectively in terms of diarization error rate (DER)). In addition, the DER results reported in this work improve over several other recent approaches for speaker diarization.

show abstract

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Cited by 15 publications

References 47 publications

Multi-Scale Speaker Diarization with Neural Affinity Score Fusion

Multi-Scale Speaker Diarization with Neural Affinity Score Fusion

ECAPA-TDNN Embeddings for Speaker Diarization

Deep Self-Supervised Hierarchical Clustering for Speaker Diarization

Contact Info

Product

Resources

About