Fine-Tuning Wav2Vec2 for Speaker Recognition

Vaessen, Nik; Leeuwen, David A. van

doi:10.1109/icassp43922.2022.9746952

Cited by 51 publications

(23 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our basic approach to speaker change detection is to finetune a network (pre-)trained for an ASR task where we include speaker (change) information as special labels. Our default workhorse will be Wav2vec2 [6], for which pretrained models and code are readily available, and for which it has been shown that it can be finetuned to carry out various other tasks than ASR, such as emotion [8], language [9] and speaker recognition [10]. We carry out several experiments, demanding progressively more from a special target label that we will denote as SC, for speaker change.…”

Section: Speaker Change Detection Approachmentioning

confidence: 99%

“…In the Speaker Label training condition, we want to use the input vector to the classification layer corresponding to the SC label as embedding for speaker recognition experiments. In [10] the authors found that many pooling strategies for extracting an embedding from the Wav2vec2 vector sequence work, even choosing a random vector (instead of taking the sequence mean) can function well for speaker recognition. We therefore hypothesize that the vector(s) associated with the virtual SC symbol can be used for an embedding related to the speaker following that SC symbol.…”

Section: Extracting Speaker Embeddingsmentioning

confidence: 99%

“…This row also shows that through training separate 'speaker announce targets', we can use the embedding associated with the summed-posterior SC label for speaker recognition. A performance of approximately 11 % on Librispeech speakers may not seem spectacular compared to, e.g., 2.6 % for the more challenging VoxCeleb data [10] in a very similar architecture: using the first embedding vector with cross-entropy loss. However, we have to keep in mind that this model also predicts letter targets for ASR, so this in a way a multi-task setup.…”

Section: Speaker Recognition With Sc Embeddingsmentioning

confidence: 99%

See 2 more Smart Citations

Speaker and Language Change Detection using Wav2vec2 and Whisper

Berns¹,

Vaessen²,

Leeuwen³

2023

Preprint

View full text Add to dashboard Cite

We investigate recent transformer networks pre-trained for automatic speech recognition for their ability to detect speaker and language changes in speech. We do this by simply adding speaker (change) or language targets to the labels. For Wav2vec2 pre-trained networks, we also investigate if the representation for the speaker change symbol can be conditioned to capture speaker identity characteristics. Using a number of constructed data sets we show that these capabilities are definitely there, with speaker recognition equal error rates of the order of 10 % and language detection error rates of a few percent. We will publish the code for reproducibility.

show abstract

Section: Speaker Change Detection Approachmentioning

confidence: 99%

Section: Extracting Speaker Embeddingsmentioning

confidence: 99%

Section: Speaker Recognition With Sc Embeddingsmentioning

confidence: 99%

See 1 more Smart Citation

Speaker and Language Change Detection using Wav2vec2 and Whisper

Berns¹,

Vaessen²,

Leeuwen³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Most of the works that attain state-of-the-art performance in speaker recognition using Transformed-based pre-trained models are utilizing them as feature extractors, and feed their features to a TDNN or ECAPA-TDNN [9,10,6]. Other approaches [13,14] fine-tune the pre-trained model using only very lightweight back-end (i.e directly pool the features from the pre-trained model to extract embeddings, which are then fed to a simple linear classifier), but yield inferior performance compared to the former. The main goal of this paper is to explore methods that do not require deep and convolutional architectures and preserve the attention-based nature of the pre-trained network, while attaining state-of-the-art performance.…”

Section: Extracting Speaker Information From Pre-trained Modelsmentioning

confidence: 99%

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Peng¹,

Plchot²,

Stafylakis³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and a learning rate scheduler to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pretrained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively. 1

show abstract

“…Another benefit is that these pre-trained networks are flexible, and can be fine-tuned to a variety of related tasks. This has been shown to be the case for wav2vec2 as well, which, while originally designed for speech recognition [1], has also been used for tasks like speaker recognition [3][4][5] and emotion recognition [5][6][7]. One property of fine-tuning a pre-trained network is that it requires less labeled data than training from scratch.…”

Section: Introductionmentioning

confidence: 99%

Training speaker recognition systems with limited data

Vaessen¹,

Leeuwen²

2022

Interspeech 2022

View full text Add to dashboard Cite

This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50 k audio files (versus over 1 M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at https://github.com/ nikvaessen/w2v2-speaker-few-samples.

show abstract

Fine-Tuning Wav2Vec2 for Speaker Recognition

Cited by 51 publications

References 23 publications

Speaker and Language Change Detection using Wav2vec2 and Whisper

Speaker and Language Change Detection using Wav2vec2 and Whisper

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Training speaker recognition systems with limited data

Contact Info

Product

Resources

About