Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization

Sang, Mufan; Li, Haoqi; Liu, Fang; Arnold, Andrew O.; Wan, Li

doi:10.1109/icassp43922.2022.9747526

Cited by 21 publications

(11 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For comparison, CEL [ 24 ], SimSiamReg [ 8 ], C-SimSiam [ 7 ], and DINO-Reg [ 9 ] were implemented and trained using the optimal parameters that were suggested by the investigators who proposed the models. However, for DINO-Reg, 3-second and 2-second speech segments were used as the long and short segments, respectively.…”

Section: Resultsmentioning

confidence: 99%

“…As a result, it can be trained using an unlabeled speech dataset and then applied to recognize speakers across various datasets. Similar to the prior works [ 8 , 9 , 24 ], it is assumed that the model is trained with a dataset, where each piece of audio contains the speech of only one person. For each audio piece, two random segments from the audio speech is selected to train the LVDNet model in every epoch.…”

Section: Methodsmentioning

confidence: 99%

“…SimSiam has an SSL architecture that is one of the minimalist one, which can learn similarity. Sang et al [ 8 ] proposed a regularized version of SimSiam for open-set speaker recognition. Apart from the default loss proposed in SimSiam, the method inherits angular prototypical loss.…”

Section: Related Workmentioning

confidence: 99%

“…Deep neural networks (DNNs) have been extensively investigated for speaker recognition [ 6 ], along with the loss functions they are trained with. Yet, DNNs have not been investigated for open-set speaker recognition, and most studies solely focus on altering the loss functions [ 7 , 8 , 9 ]. As a result, DNNs in the current literature do not utilize speech features to learn speech representation.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Self-Supervised Open-Set Speaker Recognition with Laguerre–Voronoi Descriptors

Ohi,

Gavrilova

2024

Sensors

View full text Add to dashboard Cite

Speaker recognition is a challenging problem in behavioral biometrics that has been rigorously investigated over the last decade. Although numerous supervised closed-set systems inherit the power of deep neural networks, limited studies have been made on open-set speaker recognition. This paper proposes a self-supervised open-set speaker recognition that leverages the geometric properties of speaker distribution for accurate and robust speaker verification. The proposed framework consists of a deep neural network incorporating a wider viewpoint of temporal speech features and Laguerre–Voronoi diagram-based speech feature extraction. The deep neural network is trained with a specialized clustering criterion that only requires positive pairs during training. The experiments validated that the proposed system outperformed current state-of-the-art methods in open-set speaker recognition and cluster representation.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Self-Supervised Open-Set Speaker Recognition with Laguerre–Voronoi Descriptors

Ohi,

Gavrilova

2024

Sensors

View full text Add to dashboard Cite

show abstract

“…Simple contrastive learning (SCL) [7], [17] technique trains the speaker encoder by attracting positive pairs (two augmented segments from the same utterance) and repelling negative pairs (two augmented segments from different utterances). Others further set additional training targets to improve the comparative efficiency, such as invariance of augmentation [17], invariance of channel [16], equilibrium learning [36] and positive term regularization [37].…”

Section: B Self-supervised Learning Of Speaker Encodermentioning

confidence: 99%

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Tao

Lee

Das³

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.

show abstract

A Novel Self-supervised Representation Learning Model for an Open-Set Speaker Recognition

Ohi,

Gavrilova

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization

Cited by 21 publications

References 19 publications

Self-Supervised Open-Set Speaker Recognition with Laguerre–Voronoi Descriptors

Self-Supervised Open-Set Speaker Recognition with Laguerre–Voronoi Descriptors

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

A Novel Self-supervised Representation Learning Model for an Open-Set Speaker Recognition

Contact Info

Product

Resources

About