ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414713
|View full text |Cite
|
Sign up to set email alerts
|

An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

Abstract: In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 25 publications
(24 citation statements)
references
References 17 publications
(26 reference statements)
0
23
0
Order By: Relevance
“…Relation with iterative clustering. Iterative clustering has been proven to be effective for self-supervised speaker verification models [30,47]. Our model can be viewed as the initial model of iterative clustering; thus, the model can enjoy the benefits of iterative clustering methods.…”
Section: Self-supervised Learningmentioning
confidence: 99%
“…Relation with iterative clustering. Iterative clustering has been proven to be effective for self-supervised speaker verification models [30,47]. Our model can be viewed as the initial model of iterative clustering; thus, the model can enjoy the benefits of iterative clustering methods.…”
Section: Self-supervised Learningmentioning
confidence: 99%
“…The clustering is fixed during training, i.e. we do not re-cluster the chunks while training the extractor, as some self-supervised algorithms do [9,10].…”
Section: Baseline Speaker Diarizationmentioning
confidence: 99%
“…The emergence of self-supervision methods in deep learning has also been applied to training speaker embedding extractors [8,9,10,11,12]. Several approaches have been examined, some of which employ an audiovisual setting [13,14].…”
Section: Introductionmentioning
confidence: 99%
“…Most recently, Kahn et al [22] investigated end-to-end ASR with pseudo-labeling, and Xu et al [34] proposed iterative pseudo-labeling as an extension of it. For speaker recognition, Cai et al [35] proposed an iterative framework with pseudo-labeling to train a speaker embedding network. While these studies focus on utilizing a large amount of unlabeled data, we aim to adapt the model to a target condition using unlabeled data.…”
Section: Related Workmentioning
confidence: 99%