An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

Cai, Danwei; Wang, Weiqing; Li, Ming

doi:10.1109/icassp39728.2021.9414713

Cited by 25 publications

(24 citation statements)

References 17 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relation with iterative clustering. Iterative clustering has been proven to be effective for self-supervised speaker verification models [30,47]. Our model can be viewed as the initial model of iterative clustering; thus, the model can enjoy the benefits of iterative clustering methods.…”

Section: Self-supervised Learningmentioning

confidence: 99%

Pushing the limits of raw waveform speaker recognition

Jung¹,

Kim²,

Heo³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, speaker recognition systems based on raw waveform inputs have received increasing attention. However, the performance of such systems are typically inferior to the state-of-the-art handcrafted feature-based counterparts, which demonstrate equal error rates under 1% on the popular VoxCeleb1 test set. This paper proposes a novel speaker recognition model based on raw waveform inputs. The model incorporates recent advances in machine learning and speaker verification, including the Res2Net backbone module and multi-layer feature aggregation. Our best model achieves an equal error rate of 0.89%, which is competitive with the state-of-the-art models based on handcrafted features, and outperforms the best model based on raw waveform inputs by a large margin. We also explore the application of the proposed model in the context of self-supervised learning framework. Our self-supervised model outperforms single phase-based existing works in this line of research. Finally, we show that self-supervised pre-training is effective for the semi-supervised scenario where we only have a small set of labelled training data, along with a larger set of unlabelled examples.

show abstract

Section: Self-supervised Learningmentioning

confidence: 99%

Pushing the limits of raw waveform speaker recognition

Jung¹,

Kim²,

Heo³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The clustering is fixed during training, i.e. we do not re-cluster the chunks while training the extractor, as some self-supervised algorithms do [9,10].…”

Section: Baseline Speaker Diarizationmentioning

confidence: 99%

“…The emergence of self-supervision methods in deep learning has also been applied to training speaker embedding extractors [8,9,10,11,12]. Several approaches have been examined, some of which employ an audiovisual setting [13,14].…”

Section: Introductionmentioning

confidence: 99%

Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

Stafylakis¹,

Mošner²,

Plchot³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parameter tuning, a modified loss with aggregation over segments, and a two-stage training approach, we are able to train a competitive ResNet-based embedding extractor. Finally, we experiment with two different aggregation functions and analyze their behaviour in terms of their gradients.

show abstract

“…Most recently, Kahn et al [22] investigated end-to-end ASR with pseudo-labeling, and Xu et al [34] proposed iterative pseudo-labeling as an extension of it. For speaker recognition, Cai et al [35] proposed an iterative framework with pseudo-labeling to train a speaker embedding network. While these studies focus on utilizing a large amount of unlabeled data, we aim to adapt the model to a target condition using unlabeled data.…”

Section: Related Workmentioning

confidence: 99%

Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization

et al. 2021

View full text Add to dashboard Cite

In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a welltuned model, EEND requires labeled data for all the joint speech activities of every speaker at each time frame in a recording. In this paper, we explore a pseudo-labeling approach that employs unlabeled data. First, we propose an iterative pseudolabel method for EEND, which trains the model using unlabeled data of a target condition. Then, we also propose a committeebased training method to improve the performance of EEND. To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data. Experimental results on the CALLHOME dataset show that our proposed pseudo-label achieved a 37.4% relative diarization error rate reduction compared to a seed model. Moreover, we analyzed the results of semi-supervised adaptation with pseudo-labeling. We also show the effectiveness of our approach on the third DI-HARD dataset.

show abstract

An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

Cited by 25 publications

References 17 publications

Pushing the limits of raw waveform speaker recognition

Pushing the limits of raw waveform speaker recognition

Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization

Contact Info

Product

Resources

About