ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747526
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Speaker Verification with Simple Siamese Network and Self-Supervised Regularization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 21 publications
(11 citation statements)
references
References 19 publications
0
11
0
Order By: Relevance
“…For comparison, CEL [ 24 ], SimSiamReg [ 8 ], C-SimSiam [ 7 ], and DINO-Reg [ 9 ] were implemented and trained using the optimal parameters that were suggested by the investigators who proposed the models. However, for DINO-Reg, 3-second and 2-second speech segments were used as the long and short segments, respectively.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…For comparison, CEL [ 24 ], SimSiamReg [ 8 ], C-SimSiam [ 7 ], and DINO-Reg [ 9 ] were implemented and trained using the optimal parameters that were suggested by the investigators who proposed the models. However, for DINO-Reg, 3-second and 2-second speech segments were used as the long and short segments, respectively.…”
Section: Resultsmentioning
confidence: 99%
“…As a result, it can be trained using an unlabeled speech dataset and then applied to recognize speakers across various datasets. Similar to the prior works [ 8 , 9 , 24 ], it is assumed that the model is trained with a dataset, where each piece of audio contains the speech of only one person. For each audio piece, two random segments from the audio speech is selected to train the LVDNet model in every epoch.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Simple contrastive learning (SCL) [7], [17] technique trains the speaker encoder by attracting positive pairs (two augmented segments from the same utterance) and repelling negative pairs (two augmented segments from different utterances). Others further set additional training targets to improve the comparative efficiency, such as invariance of augmentation [17], invariance of channel [16], equilibrium learning [36] and positive term regularization [37].…”
Section: B Self-supervised Learning Of Speaker Encodermentioning
confidence: 99%