Knowledge Amalgamation from Heterogeneous Networks by Common Feature Learning

Luo, Sihui; Wang, Xinchao; Fang, Gang; Hu, Yao; Tao, Dapeng; Song, Mingli

doi:10.48550/arxiv.1906.10546

Cited by 4 publications

(7 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, (Romero et al, 2015;Wang et al, 2018;Shen et al, 2018;Ye et al, 2020b) propose using intermediate feature representations as distillation targets instead of just network outputs, and (Tarvainen & Valpola, 2017;Yang et al, 2018;Zhang et al, 2019a) unify student and teacher network training to reduce computational costs. Knowledge distillation has also been extended to distilling multiple teachers, which is termed Knowledge Amalgamation (Shen et al, 2019a;Luo et al, 2019;Ye et al, 2019;.…”

Section: Related Workmentioning

confidence: 99%

Optimizer Amalgamation

Huang¹,

Tianlong²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Selecting an appropriate optimizer for a given problem is of major interest for researchers and practitioners. Many analytical optimizers have been proposed using a variety of theoretical and empirical approaches; however, none can offer a universal advantage over other competitive optimizers. We are thus motivated to study a new problem named Optimizer Amalgamation: how can we best combine a pool of "teacher" optimizers into a single "student" optimizer that can have stronger problem-specific performance? In this paper, we draw inspiration from the field of "learning to optimize" to use a learnable amalgamation target. First, we define three differentiable amalgamation mechanisms to amalgamate a pool of analytical optimizers by gradient descent. Then, in order to reduce variance of the amalgamation process, we also explore methods to stabilize the amalgamation process by perturbing the amalgamation target. Finally, we present experiments showing the superiority of our amalgamated optimizer compared to its amalgamated components and learning to optimize baselines, and the efficacy of our variance reducing perturbations. Our code and pre-trained models are publicly available at http://github.com/VITA-Group/OptimizerAmalgamation.

show abstract

Section: Related Workmentioning

confidence: 99%

Optimizer Amalgamation

Huang¹,

Tianlong²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In particular, to amalgamate intermediate teacher features, [19] develops an encoder-decoder structure. Luo et al [20] adopt common feature learning to project features of all the teachers and the student close to each other. These CNN-based KA approaches share a common strategy that the student requires a fixed-sized hint (generated mostly by projection), which suffers from extra learning burden and loss of information.…”

Section: B Model Reusingmentioning

confidence: 99%

“…Furthermore, there has been an increasing interest in knowledge amalgamation (KA), an extension of KD, where knowledge of several teachers is transferred to one multi-talent student [18]- [23]. For example, [18]- [20] focus on training a student with complementary knowledge from homogeneous tasks, e.g., a couple of classification problems. However, these methods share a common strategy: the intermediate student features are required to mimic the aggregated hints (usually achieved by linear projection).…”

Section: Introductionmentioning

confidence: 99%

“…Specifically, Shen et al [19] propose an auto-encoder structure for aggregate hints from different teachers. Moreover, Luo et al [20] aim at projecting all these features into a joint embedding space where they are closed to each other with the same input. However, most works are proposed in the realm of convolutional neural networks (CNNs).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Knowledge Amalgamation for Object Detection with Transformers

Zhang¹,

Mao²,

Xue³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a multi-talented and compact student. Currently, most of these approaches are tailored for convolutional neural networks (CNNs). However, there is a tendency that transformers, with a completely different architecture, are starting to challenge the domination of CNNs in many computer vision tasks. Nevertheless, directly applying the previous KA methods to transformers leads to severe performance degradation. In this work, we explore a more effective KA scheme for transformer-based object detection models. Specifically, considering the architecture characteristics of transformers, we propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA). In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one as previous KA works. Besides, the student learns heterogeneous detection tasks through soft targets with efficiency in the tasklevel amalgamation. Extensive experiments on PASCAL VOC and COCO have unfolded that the sequence-level amalgamation significantly boosts the performance of students, while the previous methods impair the students. Moreover, the transformerbased students excel in learning amalgamated knowledge, as they have mastered heterogeneous detection tasks rapidly and achieved superior or at least comparable performance to those of the teachers in their specializations.

show abstract

“…The knowledge is distilled via a layer-wise neuron sharing mechanism. CFL [25] distills the knowledge by learning a common feature space, wherein the student model mimics the transformed features of the teachers to aggregate knowledge. Although many such methods are proposed, the models involved are usually limited within grid domain.…”

Section: Related Workmentioning

confidence: 99%

Distilling Knowledge from Graph Convolutional Networks

Yang

Qiu

Song

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Figure 1: (a) Unlike existing knowledge distillation methods that focus on only the prediction or the middle activation, our method explicitly distills knowledge about how the teacher model embeds the topological structure and transfers it to the student model. (b) We display the structure of the feature space, visualized by the distance between the red point and the others on a point cloud dataset. Here, each object is represented as a set of 3D points. Top Row: structures obtained from the teacher; Middle Row: structures obtained from the student trained with the local structure preserving (LSP) module; Bottom Row: structures obtained from the student trained without LSP. Features in the middle and bottom row are obtained from the last layer of the model after training for ten epochs. As we can see, model trained with LSP learns a similar structure as that of the teacher, while the model without LSP fails to do so.

show abstract

Knowledge Amalgamation from Heterogeneous Networks by Common Feature Learning

Cited by 4 publications

References 5 publications

Optimizer Amalgamation

Optimizer Amalgamation

Knowledge Amalgamation for Object Detection with Transformers

Distilling Knowledge from Graph Convolutional Networks

Contact Info

Product

Resources

About