ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746952
|View full text |Cite
|
Sign up to set email alerts
|

Fine-Tuning Wav2Vec2 for Speaker Recognition

Abstract: This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a singleutterance classification variant with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing varian… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 51 publications
(23 citation statements)
references
References 23 publications
0
18
0
Order By: Relevance
“…Our basic approach to speaker change detection is to finetune a network (pre-)trained for an ASR task where we include speaker (change) information as special labels. Our default workhorse will be Wav2vec2 [6], for which pretrained models and code are readily available, and for which it has been shown that it can be finetuned to carry out various other tasks than ASR, such as emotion [8], language [9] and speaker recognition [10]. We carry out several experiments, demanding progressively more from a special target label that we will denote as SC, for speaker change.…”
Section: Speaker Change Detection Approachmentioning
confidence: 99%
See 2 more Smart Citations
“…Our basic approach to speaker change detection is to finetune a network (pre-)trained for an ASR task where we include speaker (change) information as special labels. Our default workhorse will be Wav2vec2 [6], for which pretrained models and code are readily available, and for which it has been shown that it can be finetuned to carry out various other tasks than ASR, such as emotion [8], language [9] and speaker recognition [10]. We carry out several experiments, demanding progressively more from a special target label that we will denote as SC, for speaker change.…”
Section: Speaker Change Detection Approachmentioning
confidence: 99%
“…In the Speaker Label training condition, we want to use the input vector to the classification layer corresponding to the SC label as embedding for speaker recognition experiments. In [10] the authors found that many pooling strategies for extracting an embedding from the Wav2vec2 vector sequence work, even choosing a random vector (instead of taking the sequence mean) can function well for speaker recognition. We therefore hypothesize that the vector(s) associated with the virtual SC symbol can be used for an embedding related to the speaker following that SC symbol.…”
Section: Extracting Speaker Embeddingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of the works that attain state-of-the-art performance in speaker recognition using Transformed-based pre-trained models are utilizing them as feature extractors, and feed their features to a TDNN or ECAPA-TDNN [9,10,6]. Other approaches [13,14] fine-tune the pre-trained model using only very lightweight back-end (i.e directly pool the features from the pre-trained model to extract embeddings, which are then fed to a simple linear classifier), but yield inferior performance compared to the former. The main goal of this paper is to explore methods that do not require deep and convolutional architectures and preserve the attention-based nature of the pre-trained network, while attaining state-of-the-art performance.…”
Section: Extracting Speaker Information From Pre-trained Modelsmentioning
confidence: 99%
“…Another benefit is that these pre-trained networks are flexible, and can be fine-tuned to a variety of related tasks. This has been shown to be the case for wav2vec2 as well, which, while originally designed for speech recognition [1], has also been used for tasks like speaker recognition [3][4][5] and emotion recognition [5][6][7]. One property of fine-tuning a pre-trained network is that it requires less labeled data than training from scratch.…”
Section: Introductionmentioning
confidence: 99%