ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053952
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Diarization Using Latent Space Clustering in Generative Adversarial Network

Abstract: In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(12 citation statements)
references
References 47 publications
1
11
0
Order By: Relevance
“…We show the DER values of the PLDA+AHC approach for three different segment scales (1.5, 1.0, and 0.5 s) and how the performance of the diarization changes with the distance measure and clustering method. We also list the lowest DER value that we could find that has appeared in a published paper on speaker diarization [20,24], including the CHAES-eval results of our previous study [18].…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…We show the DER values of the PLDA+AHC approach for three different segment scales (1.5, 1.0, and 0.5 s) and how the performance of the diarization changes with the distance measure and clustering method. We also list the lowest DER value that we could find that has appeared in a published paper on speaker diarization [20,24], including the CHAES-eval results of our previous study [18].…”
Section: Discussionmentioning
confidence: 99%
“…CHAES (LDC97S42) is a corpus that contains only English speech data. CHAES is divided into train (80), dev (20), and eval (20) splits.…”
Section: Call Home American English Speech (Chaes)mentioning
confidence: 99%
See 1 more Smart Citation
“…The channels in the microphone array are beamformed with the standard BeamformIt toolkit [27]. The same split is used in many other works [6,8,[28][29][30].…”
Section: Datasetsmentioning
confidence: 99%
“…Separately, the application of end-to-end modeling for two speaker conversational data has been explored in [19]. In the end-to-end learning, the input features are fed to a model where the loss is either permutation-invariant cross entropy [39], [40] or clustering based [41]. Further to refine the boundaries of segmentation output in speaker diarization, a second re-segmentation step involving frame-level (20-30ms) modeling [26], [27] can be performed.…”
Section: Related Workmentioning
confidence: 99%