Entropy Targets for Adaptive Distillation

Líu, Hao; Yan, Haowen; Xia, Jinxiang; Ai, Ying

doi:10.1145/3449301.3449332

Cited by 1 publication

(2 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, models like BERT have shown improved performance with more parameters, leading to gradual increases in model size. Thus, research is being conducted to reduce both the number of parameters and the computational complexity of such models [5][6][7][8].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

F-ALBERT: A Distilled Model from a Two-Time Distillation System for Reduced Computational Complexity in ALBERT Model

Kim,

Jeong

2023

Applied Sciences

View full text Add to dashboard Cite

Recently, language models based on the Transformer architecture have been predominantly used in AI natural language processing. These models, which have been proven to perform better with more parameters, have led to a significant increase in model size and computational load. ALBERT solves this problem by significantly reducing the number of parameters it retains by repeatedly reusing parameters. Although ALBERT significantly reduces the parameters it maintains, it requires a computational load similar to the original language model due to the reuse process. In this study, we develop a distillation system that decreases the number of times the ALBERT model reuses parameters and progressively reduces the parameters being reused. We propose a representation in this distillation system that can effectively distill the knowledge of the original model and develop a new architecture with reduced computation. Through this system, F-ALBERT, which had about half the computational load compared to the ALBERT model, restored about 98% of the performance of the original model on the GLUE benchmark.

show abstract

Section: Introductionmentioning

confidence: 99%

“…Feature Representations Transferring features in a processed rather than simple form can lead to more effective results [8,[21][22][23]. Examples include transformation methods such as cosine representation or Euclidean distance.…”

mentioning

confidence: 99%

F-ALBERT: A Distilled Model from a Two-Time Distillation System for Reduced Computational Complexity in ALBERT Model

Kim,

Jeong

2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

Entropy Targets for Adaptive Distillation

Cited by 1 publication

References 6 publications

F-ALBERT: A Distilled Model from a Two-Time Distillation System for Reduced Computational Complexity in ALBERT Model

F-ALBERT: A Distilled Model from a Two-Time Distillation System for Reduced Computational Complexity in ALBERT Model

Contact Info

Product

Resources

About