ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413351
|View full text |Cite
|
Sign up to set email alerts
|

Contrastive Self-Supervised Learning for Text-Independent Speaker Verification

Abstract: Current speaker verification models rely on supervised training with massive annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive selfsupervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
35
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(35 citation statements)
references
References 20 publications
(18 reference statements)
0
35
0
Order By: Relevance
“…Unsupervised speech processing algorithms, primarily intended as components in larger speech technology applications, and the related evaluation practices form the third category in our taxonomy (Category C in Table 1). These algorithms are often evaluated in terms of how they affect the performance of the larger system they are part of, such as word error rates in automatic speech recognition (e.g., Baevski et al, 2020), accuracy of speaker verification systems (Zhang et al, 2021), performance of language models trained on acoustic speech (Kharitonov et al, 2021), or accuracy in audiovisual retrieval tasks (e.g., Harwath et al, 2016). Another alternative is to use so-called diagnostic classifiers to probe the types of information encoded by the learned representations (e.g., performing speaker or phoneme classification using the learned representations as speech features; e.g., Oord et al, 2018), where the aim is to understand the potential of the method for different speech processing use cases.…”
Section: Reference Point? Pros Consmentioning
confidence: 99%
“…Unsupervised speech processing algorithms, primarily intended as components in larger speech technology applications, and the related evaluation practices form the third category in our taxonomy (Category C in Table 1). These algorithms are often evaluated in terms of how they affect the performance of the larger system they are part of, such as word error rates in automatic speech recognition (e.g., Baevski et al, 2020), accuracy of speaker verification systems (Zhang et al, 2021), performance of language models trained on acoustic speech (Kharitonov et al, 2021), or accuracy in audiovisual retrieval tasks (e.g., Harwath et al, 2016). Another alternative is to use so-called diagnostic classifiers to probe the types of information encoded by the learned representations (e.g., performing speaker or phoneme classification using the learned representations as speech features; e.g., Oord et al, 2018), where the aim is to understand the potential of the method for different speech processing use cases.…”
Section: Reference Point? Pros Consmentioning
confidence: 99%
“…In order to make full use of a large quantity of unlabeled data, many efforts [5][6][7][8][9][10][11] have been made to obtain good speaker representations in a self-supervised learning manner. Following the iterative framework proposed in [12], the current state-of-the-art self-supervised speaker verification systems usually include two stages.…”
Section: Introductionmentioning
confidence: 99%
“…Following the iterative framework proposed in [12], the current state-of-the-art self-supervised speaker verification systems usually include two stages. In stage I, a speaker encoder is trained by contrastive learning based loss [9]. In stage II, estimating pseudo labels from the pre-trained model and then training a new model based on the estimated pseudo labels are iteratively performed to continuously improve the performance.…”
Section: Introductionmentioning
confidence: 99%
“…In speaker verification, different self-supervised methods have been proposed as in [7,8,9,10,11,12,13]. Some of these methods use a generative approach [7,10,8,9], i.e., they learn to reconstruct the signal acoustic features from some latent representations.…”
Section: Introductionmentioning
confidence: 99%
“…SSL methods based on contrastive loss are also popu-lar [11,12,13] in speaker verification. Contrastive losses intend to make the current sample (anchor) close to the augmented version of the anchor (positive sample) while making the positive sample farther from the negative samples in their embedding space.…”
Section: Introductionmentioning
confidence: 99%