data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Baevski, Alexei; Hsu, Wei-Ning; Xu, Qiantong; Babu, Arun; Gu, Jiatao; Auli, Michael

doi:10.48550/arxiv.2202.03555

Cited by 78 publications

(129 citation statements)

References 21 publications

Supporting

Mentioning

118

Contrasting

Order By: Relevance

“…The learning dynamics of Odin also warrant further investigation, as well as the objective used for representation learning. Recent work has revived interest in masked-autoencoding [7,23,36] and maskeddistillation [6] as viable alternatives to contrastive learning. Odin, by proposing to leverage the learned representations in the design of iteratively more refined self-supervised tasks, is well positioned to benefit them as well.…”

Section: Discussionmentioning

confidence: 99%

Object discovery and representation networks

Hénaff¹,

Koppula²,

Shelhamer³

et al. 2022

Preprint

View full text Add to dashboard Cite

The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategies, these methods sacrifice the simplicity and generality that makes SSL so powerful. Instead, we propose a self-supervised learning paradigm that discovers the structure encoded in these priors by itself. Our method, Odin, couples object discovery and representation networks to discover meaningful image segmentations without any supervision. The resulting learning paradigm is simpler, less brittle, and more general, and achieves state-of-the-art transfer learning results for object detection and instance segmentation on COCO, and semantic segmentation on PASCAL and Cityscapes, while strongly surpassing supervised pre-training for video segmentation on DAVIS.

show abstract

Section: Discussionmentioning

confidence: 99%

Object discovery and representation networks

Hénaff¹,

Koppula²,

Shelhamer³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Finally, the recently proposed masked auto-encoder (MAE) [29,56,22,20,2] is a new SSL family. It builds on a reconstruction task that randomly masks image patches and then reconstructs the missing pixels or semantic features via an auto-encoder.…”

Section: Related Workmentioning

confidence: 99%

“…These results well testify the high quality, generality and transferability of the learnt features by Mugs. Note that in this work, we evaluate the effectiveness of Mugs through vision transformer (ViT) [23,39], as ViT often achieves better performance than CNN of the same model size [49,39] and also shows great potential to unify vision and language models [28,2].…”

Section: Introductionmentioning

confidence: 99%

Mugs: A Multi-Granular Self-Supervised Learning Framework

Zhou¹,

Zhou²,

Si³

et al. 2022

Preprint

View full text Add to dashboard Cite

In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require different or multigranular features, e.g. fine-or coarse-grained one or their mixture. In this work, for the first time, we propose an effective MUlti-Granular Selfsupervised learning (Mugs) framework to explicitly learn multi-granular visual features. Mugs has three complementary granular supervisions: 1) an instance discrimination supervision (IDS), 2) a novel local-group discrimination supervision (LGDS), and 3) a group discrimination supervision (GDS). IDS distinguishes different instances to learn instancelevel fine-grained features.LGDS aggregates features of an image and its neighbors into a local-group feature, and pulls local-group features from different crops of the same image together and push them away for others. It provides complementary instance supervision to IDS via an extra alignment on local neighbors, and scatters different local-groups separately to increase discriminability. Accordingly, it helps learn high-level fine-grained features at a local-group level. Finally, to prevent similar local-groups from being scattered randomly or far away, GDS brings similar samples close and thus pulls similar local-groups together, capturing coarse-grained features at a (semantic) group level. Consequently, Mugs can capture three granular features that often enjoy higher generality on diverse downstream tasks over single-granular features, e.g. instancelevel fine-grained features in contrastive learning. By only pretraining on ImageNet-1K, Mugs sets new SoTA linear probing accuracy 82.1% on ImageNet-1K and improves previous SoTA by 1.1%. It also surpasses SoTAs on other tasks, e.g. transfer learning, detection and segmentation. Codes and models are available at https://github.com/sail-sg/mugs.

show abstract

“…Specifically, we mask spans of latent speech representations in the student model and make the student model predict masked parts as the output of the teacher model. Inspired by [13], we introduce contextualized representations as the training target, i.e., average top-k normalized latent representations, where we set k = 8 as [13]. Unlike self-distillation in [13], we leverage a pre-trained speech model as the teacher.…”

Section: Pre-training Distillationmentioning

confidence: 99%

“…Inspired by [13], we introduce contextualized representations as the training target, i.e., average top-k normalized latent representations, where we set k = 8 as [13]. Unlike self-distillation in [13], we leverage a pre-trained speech model as the teacher. Formally, given a downsampled audio sequence x, the student is to minimize the L1 distance within masked time steps M as…”

Section: Pre-training Distillationmentioning

confidence: 99%

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

Wang¹,

Bai²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under lowresource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SU-PERB benchmark show the proposed LightHuBERT enables over 10 9 architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and network depth. LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the Hu-BERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a 3.5× compression ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss. The code and pre-trained models are available at https://github.com/ mechanicalsea/lighthubert.

show abstract

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Cited by 78 publications

References 21 publications

Object discovery and representation networks

Object discovery and representation networks

Mugs: A Multi-Granular Self-Supervised Learning Framework

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

Contact Info

Product

Resources

About