Similarity Analysis of Self-Supervised Speech Representations

Chung, Yu-An; Belinkov, Yonatan; Glass, James

doi:10.1109/icassp39728.2021.9414321

Cited by 26 publications

(13 citation statements)

References 25 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They observed that standard design recipes do not translate directly from end-to-end training to selfsupervision with the number of filters being a significant factor. Finally, some approaches for post-hoc analysis of the learned representations include studying the similarity of the learned representations to the gold standard that is supervised learning [10,20,25], and understanding the intrinsic dimensionality of the representations [23,71].…”

Section: Analysis Methods For Understanding Self-supervised Approachesmentioning

confidence: 99%

See 1 more Smart Citation

Assessing the State of Self-Supervised Human Activity Recognition using Wearables

Haresamudram¹,

Essa²,

Plötz³

2022

Preprint

View full text Add to dashboard Cite

The emergence of self-supervised learning in the field of wearables-based human activity recognition (HAR) has opened up opportunities to tackle the most pressing challenges in the field, namely to exploit unlabeled data to derive reliable recognition systems from only small amounts of labeled training samples. Furthermore, self-supervised methods enable a host of new application domains such as, for example, domain adaptation and transfer across sensor positions, activities etc. As such, self-supervision, i.e., the paradigm of 'pretrain-then-finetune' has the potential to become a strong alternative to the predominant end-to-end training approaches, let alone the classic activity recognition chain with explicit focus on designing feature representations for sensor data. Recently a number of contributions have been made that introduced self-supervised learning into the field of HAR, including, Multi-task self-supervision, Masked Reconstruction, CPC to name but a few. With the initial success of these methods, the time has come for a systematic inventory and analysis of the potential self-supervised learning has for the field. This paper provides exactly that. We assess the progress of self-supervised HAR research by introducing a framework that performs a multi-faceted exploration of model performance. We organize the framework into three dimensions, each containing three constituent criteria, and utilize it to assess state-of-the-art self-supervised learning methods in a large empirical study on a curated set of nine diverse benchmarks. This exploration leads us to the formulation of insights into the properties of these techniques and to establish their value towards learning representations for diverse scenarios. Based on our findings we call upon the community to join our efforts and to contribute towards shaping the evaluation of the ongoing paradigm change in modeling human activities from body-worn sensor data.CCS Concepts: • Human-centered computing → Ubiquitous and mobile computing; Empirical studies in ubiquitous and mobile computing; • Computing methodologies → Machine learning.

show abstract

Section: Analysis Methods For Understanding Self-supervised Approachesmentioning

confidence: 99%

“…( 9) Masked reconstruction has the highest implicit dimensionality and thus more efficiently makes use of the learned representation space. (10) Utilizing the means and variances of the source dataset normalization on the target dataset results in considerable performance gains.…”

Section: Lessons Learned and Insights Gainedmentioning

confidence: 99%

Assessing the State of Self-Supervised Human Activity Recognition using Wearables

Haresamudram¹,

Essa²,

Plötz³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A growing number of self-supervised speech models have been proposed. Examples include contrastive predictive coding (CPC) [16,29], auto-regressive predictive coding [30], wav2vec [31], HuBERT [32,33], wav2vec 2.0 [12,34] and Wavlm [35], with all showing promising results for a variety of different speech processing tasks. Two particularly popular approaches, HuBERT and wav2vec 2.0, have been applied to automatic speech recognition [12,13], mispronunciation detection [36,37], speaker recognition [38,39] and emotion recognition [40].…”

Section: Related Workmentioning

confidence: 99%

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Tak,

Todisco,

Wang

et al. 2022

Preprint

View full text Add to dashboard Cite

The performance of spoofing countermeasure systems depends fundamentally upon the use of sufficiently representative training data. With this usually being limited, current solutions typically lack generalisation to attacks encountered in the wild. Strategies to improve reliability in the face of uncontrolled, unpredictable attacks are hence needed. We report in this paper our efforts to use self-supervised learning in the form of a wav2vec 2.0 front-end with fine tuning. Despite initial base representations being learned using only bona fide data and no spoofed data, we obtain the lowest equal error rates reported in the literature for both the ASVspoof 2021 Logical Access and Deepfake databases. When combined with data augmentation, these results correspond to an improvement of almost 90% relative to our baseline system.

show abstract

“…One advantage is that a large amount of data can be used even if they do not have target labels. This is a popular topic in speech processing community, and many models have been proposed: wav2vec [77], wav2vec2 [4], VQ-wav2vec [3], contrastive predictive coding [19], auto-regressive predictive coding [19], and HuBERT [36]. Readers are encouraged to check the related references and other papers in major conferences (e.g., ICASSP and Interspeech special sessions, and NISP workshop 6 ).…”

Section: Front End: Dnn-based Self-supervised Training Approachmentioning

confidence: 99%

A Practical Guide to Logical Access Voice Presentation Attack Detection

Wang¹,

Yamagishi²

2022

Preprint

View full text Add to dashboard Cite

Voice-based human-machine interfaces with an automatic speaker verification (ASV) component are commonly used in the market. However, the threat from presentation attacks is also growing since attackers can use recent speech synthesis technology to produce a naturalsounding voice of a victim. Presentation attack detection (PAD) for ASV, or speech anti-spoofing, is therefore indispensable. Research on voice PAD has seen significant progress since the early 2010s, including the advancement in PAD models, benchmark datasets, and evaluation campaigns. This chapter presents a practical guide to the field of voice PAD, with a focus on logical access attacks using text-to-speech and voice conversion algorithms and spoofing countermeasures based on artifact detection. It introduces the basic concept of voice PAD, explains the common techniques, and provides an experimental study using recent methods on a benchmark dataset. Code for the experiments is open-sourced.

show abstract

Similarity Analysis of Self-Supervised Speech Representations

Cited by 26 publications

References 25 publications

Assessing the State of Self-Supervised Human Activity Recognition using Wearables

Assessing the State of Self-Supervised Human Activity Recognition using Wearables

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

A Practical Guide to Logical Access Voice Presentation Attack Detection

Contact Info

Product

Resources

About