Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation

Huang, Yating; Hao, Yunzhe; Xu, Jiaming; Xu, Bo

doi:10.1016/j.neunet.2022.06.026

Cited by 10 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Impressive results fire up expectations equally high to the quantum world. Data-driven weather and climate predictions apparently beat the best models (Pathak et al, 2022; Bi et al, 2022), and output data can be compressed by three orders of magnitude (Huang and Hoefler 2022). Similar successes are touted in literally any application area.…”

Section: Myth 2: Everything Will Be Deep Learning!mentioning

confidence: 99%

Myths and legends in high-performance computing

Matsuoka

Domke

Wahib

et al. 2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

In this thought-provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We gathered these myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within and beyond our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore’s law. While some laws end, new directions are emerging, such as algorithmic scaling or novel architecture research. Nevertheless, these myths are rarely based on scientific facts, but rather on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates, such as whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.

show abstract

Section: Myth 2: Everything Will Be Deep Learning!mentioning

confidence: 99%

Myths and legends in high-performance computing

Matsuoka

Domke

Wahib

et al. 2023

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…Therefore, researchers have begun exploring methods for transferring knowledge from pretrained models to new models in order to speed up the training process and improve their performance. However, current knowledge transfer methods are typically based on networks of the same size (Yang et al, 2023 ), such as weight sharing, feature transfer, or knowledge transfer from deeper to shallower networks (Huang et al, 2022 ; Shi et al, 2023 ), such as knowledge distillation, and network pruning. There is a lack of knowledge transfer strategies from shallower to deeper networks.…”

Section: Introductionmentioning

confidence: 99%

DILS: depth incremental learning strategy

Wang,

Han,

et al. 2024

Front. Neurorobot.

View full text Add to dashboard Cite

There exist various methods for transferring knowledge between neural networks, such as parameter transfer, feature sharing, and knowledge distillation. However, these methods are typically applied when transferring knowledge between networks of equal size or from larger networks to smaller ones. Currently, there is a lack of methods for transferring knowledge from shallower networks to deeper ones, which is crucial in real-world scenarios such as system upgrades where network size increases for better performance. End-to-end training is the commonly used method for network training. However, in this training strategy, the deeper network cannot inherit the knowledge from the existing shallower network. As a result, not only is the flexibility of the network limited but there is also a significant waste of computing power and time. Therefore, it is imperative to develop new methods that enable the transfer of knowledge from shallower to deeper networks. To address the aforementioned issue, we propose an depth incremental learning strategy (DILS). It starts from a shallower net and deepens the net gradually by inserting new layers each time until reaching requested performance. We also derive an analytical method and a network approximation method for training new added parameters to guarantee the new deeper net can inherit the knowledge learned by the old shallower net. It enables knowledge transfer from smaller to larger networks and provides good initialization of layers in the larger network to stabilize the performance of large models and accelerate their training process. Its reasonability can be guaranteed by information projection theory and is verified by a series of synthetic and real-data experiments.

show abstract

“…power of scale [7,26]), they can provide warm-start for FL and enable better adaptation to local client distributions. Crucially, while very large PTFs with billions of parameters cannot be deployed in mobile devices, innovations in mobile hardware (equipped with GPU/TPU) [23] and advances in model compression/distillation [19,43,47] will make it possible to deploy smaller, yet equally effective models on clients' devices.…”

Section: Introductionmentioning

confidence: 99%

Data Reconstruction from Gradient Updates in Federated Learning

Zhang

et al. 2023

Machine Learning for Cyber Security

View full text Add to dashboard Cite

The explosive growth and diversity of machine learning applications motivate a fundamental rethinking of learning with mobile and edge devices. How can we address diverse/disparate client goals and learn with scarce heterogeneous data? While federated learning aims to address these issues, it has several bottlenecks and challenges hindering a unified solution. On the other hand, large transformer models have been shown to work across a variety of tasks often achieving remarkable few-shot adaptation. This raises the question: Can clients use a single generalpurpose model -rather than custom models for each task -while obeying device and network constraints? In this work, we investigate pretrained transformers (PTF) to achieve these on-device learning goals and thoroughly explore the roles of model size and modularity, where the latter refers to adaptation through modules such as prompts or adapters. Focusing on federated learning, we demonstrate that:(1) Larger scale shrinks the accuracy gaps between alternative approaches and improves heterogeneity robustness. Scale allows clients to run more local SGD epochs which can significantly reduce the number of communication rounds. At the extreme, clients can achieve respectable accuracy fully-locally highlighting the potential of fully-local learning. (2) Modularity, by design, enables >100× less communication in bits. Surprisingly, it also boosts the generalization capability of local adaptation methods and the robustness of smaller PTFs. Finally, it enables clients to solve multiple unrelated tasks simultaneously using a single PTF, whereas full updates are prone to catastrophic forgetting. These insights on scale and modularity motivate a new federated learning approach we call "You Only Load Once" (FedYolo): The clients load a full PTF model once and all future updates are accomplished through communication-efficient modules with limited catastrophic-forgetting, where each task is assigned to its own module.

show abstract

Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation

Cited by 10 publications

References 17 publications

Myths and legends in high-performance computing

Myths and legends in high-performance computing

DILS: depth incremental learning strategy

Data Reconstruction from Gradient Updates in Federated Learning

Contact Info

Product

Resources

About