Customizing Student Networks From Heterogeneous Teachers via Adaptive Knowledge Amalgamation

Shen, Chengchao; Xue, Mengqi; Wang, Xinchao; Song, Jie; Li, Sun; Song, Mingli

doi:10.1109/iccv.2019.00360

Cited by 38 publications

(25 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition to classification tasks [14,39,10], knowledge distillation can also be applied to other tasks such as semantic segmentation [25,17] and depth estimation [29]. Recently, it has also been extended to multitasking [38,34]. By learning from multiple models, the student model can combine knowledge from different tasks to achieve better performance.…”

Section: Data-driven Knowledge Distillationmentioning

confidence: 99%

Data-Free Adversarial Distillation

Fang¹,

Song²,

Shen³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Knowledge Distillation (KD) has made remarkable progress in the last few years and become a popular paradigm for model compression and knowledge transfer. However, almost all existing KD algorithms are datadriven, i.e., relying on a large amount of original training data or alternative data, which is usually unavailable in real-world scenarios. In this paper, we devote ourselves to this challenging problem and propose a novel adversarial distillation mechanism to craft a compact student model without any real-world data. We introduce a model discrepancy to quantificationally measure the difference between student and teacher models and construct an optimizable upper bound. In our work, the student and the teacher jointly act the role of the discriminator to reduce this discrepancy, when a generator adversarially produces some "hard samples" to enlarge it. Extensive experiments demonstrate that the proposed data-free method yields comparable performance to existing data-driven methods. More strikingly, our approach can be directly extended to semantic segmentation, which is more complicated than classification and our approach achieves state-of-the-art results. The code will be released.

show abstract

Section: Data-driven Knowledge Distillationmentioning

confidence: 99%

Data-Free Adversarial Distillation

Fang¹,

Song²,

Shen³

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Similar limits can be found in recent classifier amalgamation works 1 . A few recent works [21] [22] [23] has been proposed to unifying heterogeneous teacher classifiers. Without a predefined dustbin class, [23] requires overlapped classes of objects recognized by teacher models, otherwise the model failed to find an optimal feature 1 A detailed comparison at https://github.com/zju-vipa/KamalEngine alignments.…”

Section: B Multi-teacher Knowledge Distillationmentioning

confidence: 99%

“…Without a predefined dustbin class, [23] requires overlapped classes of objects recognized by teacher models, otherwise the model failed to find an optimal feature 1 A detailed comparison at https://github.com/zju-vipa/KamalEngine alignments. Both [22] and [23] learns to extract common feature representation using additional knowledge amalgamation networks. This caused extra use of memory as the number of teachers increasing.…”

Section: B Multi-teacher Knowledge Distillationmentioning

confidence: 99%

Industrial Cyber-Physical Systems-Based Cloud IoT Edge for Federated Heterogeneous Distillation

Wang

Yang

Papanastasiou

et al. 2021

IEEE Trans. Ind. Inf.

View full text Add to dashboard Cite

Deep convoloutional networks have been widely deployed in modern cyber-physical systems performing different visual classification tasks. As the fog and edge devices have different computing capacity and perform different sub-tasks, models trained for one device may not be deployable on another. Knowledge distillation technique can effectively compress well trained convolutional neural networks (CNN) into lightweight models suitable to different devices. However, due to privacy issue and transmission cost, manually annotated data for training the deep learning models are usually gradually collected and archived in different sites. Simply training a model on powerful cloud servers and compressing them for particular edge devices failed to use the distributed data stored at different sites. This offline training approach is also inefficient to deal with new data collected from the edge devices. To overcome these obstacles, we propose the heterogeneous brain storming (HBS) method for object recognition tasks in real-world IoT scenarios. Our method enables flexible bidirectional federated learning of heterogeneous models trained on distributed datasets with a new "brain storming" mechanism and optimizable temperature parameters. In our comparison experiments, this heterogeneous brain storming method outperformed multiple state-of-the-art single-model compression methods, as well as the newest multinetwork knowledge distillation methods with both homogeneous and heterogeneous classifiers. The ablation experiment results proved that the trainable temperature parameter into the conventional knowledge distillation loss can effectively ease the learning process of student networks in different methods. To the best of

show abstract

“…Network Binarization. In the field of model compression [64,44,45,5,46], network binarization techniques aim to save memory occupancy and accelerate the network inference by binarizing network parameters and then utilizing bitwise operations [14,15,4]. In recent years, various CNN binarization methods have been proposed, which can be categorized into direct binarization [6,14,15,20] and optimization-based binarization [40,4,30].…”

Section: Related Workmentioning

confidence: 99%

Meta-Aggregator: Learning to Aggregate for 1-bit Graph Neural Networks

Jing

Yang

Wang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we study a novel meta aggregation scheme towards binarizing graph neural networks (GNNs). We begin by developing a vanilla 1-bit GNN framework that binarizes both the GNN parameters and the graph features. Despite the lightweight architecture, we observed that this vanilla framework suffered from insufficient discriminative power in distinguishing graph topologies, leading to a dramatic drop in performance. This discovery motivates us to devise meta aggregators to improve the expressive power of vanilla binarized GNNs, of which the aggregation schemes can be adaptively changed in a learnable manner based on the binarized features. Towards this end, we propose two dedicated forms of meta neighborhood aggregators, an exclusive meta aggregator termed as Greedy Gumbel Neighborhood Aggregator (GNA), and a diffused meta aggregator termed as Adaptable Hybrid Neighborhood Aggregator (ANA). GNA learns to exclusively pick one single optimal aggregator from a pool of candidates, while ANA learns a hybrid aggregation behavior to simultaneously retain the benefits of several individual aggregators. Furthermore, the proposed meta aggregators may readily serve as a generic plugin module into existing full-precision GNNs. Experiments across various domains demonstrate that the proposed method yields results superior to the state of the art.

show abstract

Customizing Student Networks From Heterogeneous Teachers via Adaptive Knowledge Amalgamation

Cited by 38 publications

References 29 publications

Data-Free Adversarial Distillation

Data-Free Adversarial Distillation

Industrial Cyber-Physical Systems-Based Cloud IoT Edge for Federated Heterogeneous Distillation

Meta-Aggregator: Learning to Aggregate for 1-bit Graph Neural Networks

Contact Info

Product

Resources

About