Non-Contrastive Self-Supervised Learning for Utterance-Level Information Extraction From Speech

Cho, Jaejin; Villalba, Jesús; Moro-Velázquez, Laureano; Dehak, Najim

doi:10.1109/jstsp.2022.3197315

Cited by 9 publications

(5 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…II. For speaker verification, MCL-DPP achieves an EER of 2.89%, 3.34% and 6.47% on Vox-O, Vox-E and Vox-H, respectively, that outperform the best prior work, i.e., Cho et al [40] by 40.17% on Vox-O. For face verification, it also achieves an EER of 1.74% in Vox-O.…”

Section: Results and Analysismentioning

confidence: 88%

“…Other comparison-based self-supervised learning techniques include the MOCO framework [38], [39], which stores the negative pairs in the memory bank; the DINO framework [12], [40]- [42] that only involves positive pairs and achieves considerable improvement. For efficiency and effectiveness, we adopt the SCL framework in this study and focus on the sampling strategy of positive pairs.…”

Section: B Self-supervised Learning Of Speaker Encodermentioning

confidence: 99%

“…We use a small labelled validation set to monitor the progressive clustering process, which is valid and commonly used in related studies [15], [39], [40], [50] and the competition 12 .…”

Section: Progressive Clusteringmentioning

confidence: 99%

See 2 more Smart Citations

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Tao

Lee

Das³

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.

show abstract

Section: Results and Analysismentioning

confidence: 88%

Section: B Self-supervised Learning Of Speaker Encodermentioning

confidence: 99%

See 1 more Smart Citation

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Tao

Lee

Das³

et al. 2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…These defenses can fall under the following categories: detect and remove the attack vector; [28][29][30] implement non-differentiable functions to obscure gradients; 31,32 sanitize the attack vector to eliminate adversarial perturbations; [33][34][35] and apply formal verification [36][37][38] or certification techniques [39][40][41] to provide performance guarantees. Defenses applied to training data protect against poisoning attacks by filtering out potentially poisoned data samples [42][43][44][45][46] . Defenses within the training algorithm employ robust training techniques, such as adversarial Table 2.…”

Section: Defense Preparationmentioning

confidence: 99%

An AI blue team playbook

Tan,

Yamaguchi,

Raney

et al. 2024

Assurance and Security for AI-enabled Systems

View full text Add to dashboard Cite

In a fiercely competitive landscape, we are deploying AI systems faster than they can be security tested and defended. With developers under pressure to deliver on functionality and performance as quickly as possible, security is too often left as an afterthought. In response to emerging security challenges, we present a playbook to establish an AI blue teaming process for mitigating vulnerabilities before they can be exploited in the wild. By describing the process as part of a larger framework known as Build-Attack-Defend (BAD), we define an iterative and collaborative process between the AI system development and security teams, as well as various stakeholders. Our playbook contains the blue teaming historical context, process, lessons learned and hypothetical examples, serving as a starting point for embedding security at the heart of AI-enabled systems.

show abstract

“…Recently, motivated by the surge of self-supervised learning concepts, many deep embedding methods [7,8,9,10,11] The code associated with this article is publicly available at https://github.com/theolepage/sslsv. have proven to be very effective in benefiting from the massive amount of unlabeled data.…”

Section: Introductionmentioning

confidence: 99%

Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning

Lepage¹,

Dehak²

2022

Interspeech 2022

View full text Add to dashboard Cite

Most state-of-the-art self-supervised speaker verification systems rely on a contrastive-based objective function to learn speaker representations from unlabeled speech data. We explore different ways to improve the performance of these methods by: (1) revisiting how positive and negative pairs are sampled through a "symmetric" formulation of the contrastive loss; (2) introducing margins similar to AM-Softmax and AAM-Softmax that have been widely adopted in the supervised setting. We demonstrate the effectiveness of the symmetric contrastive loss which provides more supervision for the self-supervised task. Moreover, we show that Additive Margin and Additive Angular Margin allow reducing the overall number of false negatives and false positives by improving speaker separability. Finally, by combining both techniques and training a larger model we achieve 7.50% EER and 0.5804 minDCF on the VoxCeleb1 test set, which outperforms other contrastive self supervised methods on speaker verification.

show abstract

Non-Contrastive Self-Supervised Learning for Utterance-Level Information Extraction From Speech

Cited by 9 publications

References 36 publications

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

An AI blue team playbook

Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning

Contact Info

Product

Resources

About