Sparse MoEs meet Efficient Ensembles

Allingham, James Urquhart; Wenzel, Florian; Mariet, Zelda; Mustafa, Basil; Puigcerver, Joan; Houlsby, Neil; Jerfel, Ghassen; Fortuin, Vincent; Lakshminarayanan, Balaji; Snoek, Jasper; Tran, Dustin; Ruiz, Carlos; Jenatton, Rodolphe

doi:10.48550/arxiv.2110.03360

Cited by 2 publications

(2 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We perform extensive ablation experiments to show the effectiveness of SKDBERT in terms of teacher ensemble, sampling distribution, KD paradigm, extra learning procedure and distillation objective. Appropriately increasing the number of teachers can effectively improve the diversity of prediction (Allingham et al 2021) for obtaining better performance. As a result, we discuss the effectiveness of weak teachers (e.g., T 01 to T 03 for SKDBERT 4 , T 01 to T 06 for SKDBERT 6 ).…”

Section: Ablation Studiesmentioning

confidence: 99%

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

Ding¹,

Jiang²,

Zhang³

et al. 2023

AAAI

View full text Add to dashboard Cite

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each distillation iteration, SKD samples a teacher model from a pre-defined teacher team, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each distillation iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT model by 40% while retaining 99.5% performances of language understanding and being 100% faster.

show abstract

Section: Ablation Studiesmentioning

confidence: 99%

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

Ding¹,

Jiang²,

Zhang³

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…Key future challenges Quantifying data redundancy shall be investigated in our future study based on work by authors in (Birodkar et al, 2019), (Guo et al, 2021). To improve the heuristic function, recent work (Lakshminarayanan et al, 2016) (Allingham et al, 2021 on explicit ensembles shows strong results for uncertainty computing, and (Aghdam et al, 2019) show that adding temporal reasoning can be beneficial for data selection on object detection task. We aim to further our study by experiment on different budget sizes, while testing on the complete Semantic-KITTI dataset.…”

Section: Model Stability and Effectiveness For Samplingmentioning

confidence: 99%

LiDAR Dataset Distillation within Bayesian Active Learning Framework Understanding the Effect of Data Augmentation

Duong¹,

Almin²,

Lemarié³

et al. 2022

Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Application

View full text Add to dashboard Cite

Autonomous driving (AD) datasets have progressively grown in size in the past few years to enable better deep representation learning. Active learning (AL) has re-gained attention recently to address reduction of annotation costs and dataset size. AL has remained relatively unexplored for AD datasets, especially on point cloud data from LiDARs. This paper performs a principled evaluation of AL based dataset distillation on (1/4th) of the large Semantic-KITTI dataset. Further on, the gains in model performance due to data augmentation (DA) are demonstrated across different subsets of the AL loop. We also demonstrate how DA improves the selection of informative samples to annotate. We observe that data augmentation achieves full dataset accuracy using only 60% of samples from the selected dataset configuration. This provides faster training time and subsequent gains in annotation costs.

show abstract

Sparse MoEs meet Efficient Ensembles

Cited by 2 publications

References 11 publications

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

LiDAR Dataset Distillation within Bayesian Active Learning Framework Understanding the Effect of Data Augmentation

Contact Info

Product

Resources

About