Patient Knowledge Distillation for BERT Model Compression

Sun, Sheng; Yu, Chongxiu; Gan, Zhe; Liu, Jingjing

doi:10.48550/arxiv.1908.09355

Cited by 94 publications

(142 citation statements)

References 0 publications

Supporting

Mentioning

128

Contrasting

Order By: Relevance

“…In recent years, many approaches are proposed for the acceleration of PLMs. One popular way is to distill large-scale PLMs into lightweight models (Sanh et al, 2019;Sun et al, 2019), where the computation cost can be saved proportionally with the reduction of model size. Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear).…”

Section: Related Workmentioning

confidence: 99%

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Zhang¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

News feed recommendation is an important web service. In recent years, pre-trained language models (PLMs) have been intensively applied to improve the recommendation quality. However, the utilization of these deep models is limited in many aspects, such as lack of explainability and being incompatible with the existing inverted index systems. Above all, the PLMs based recommenders are inefficient, as the encoding of user-side information will take huge computation costs. Although the computation can be accelerated with efficient transformers or distilled PLMs, it is still not enough to make timely recommendations for the active users, who are associated with super long news browsing histories.In this work, we tackle the efficient news recommendation problem from a distinctive perspective. Instead of relying on the entire input (i.e., the collection of news articles a user ever browsed), we argue that the user's interest can be fully captured merely with those representative keywords. Motivated by this, we propose GateFormer, where the input data is gated before feeding into transformers. The gating module is made personalized, lightweight and end-to-end learnable, such that it may perform accurate and efficient filtering of informative user input. GateFormer achieves highly impressive performances in experiments, where it notably outperforms the existing acceleration approaches in both accuracy and efficiency. We also surprisingly find that even with over 10-fold compression of the original input, GateFormer is still able to maintain onpar performances with the SOTA methods.

show abstract

Section: Related Workmentioning

confidence: 99%

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Zhang¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformer-based pretrain language models like BERT Devlin et al (2018) exhibit excellent performances but are also computationally expensive. There have been many works that attempt to compress Transformer-based models with knowledge distillation (Jiao et al, 2019;Sun et al, 2019a;Wang et al, 2020). The distilled knowledge may be soft target probabilities, embedding outputs, hidden representations or attention weight distributions.…”

Section: Related Workmentioning

confidence: 99%

Stepping Back to SMILES Transformers for Fast Molecular Representation Inference

Zhu¹,

Li²,

Cai³

et al. 2021

Preprint

View full text Add to dashboard Cite

In the intersection of molecular science and deep learning, tasks like virtual screening have driven the need for a high-throughput molecular representation generator on large chemical databases. However, as SMILES strings are the most common storage format for molecules, using deep graph models to extract molecular feature from raw SMILES data requires an SMILES-to-graph conversion, which significantly decelerates the whole process. Directly deriving molecular representations from SMILES is feasible, yet there exists a performance gap between the existing unpretrained SMILES-based models and graph-based models at large-scale benchmark results, while pretrain models are resource-demanding at training. To address this issue, we propose ST-KD, an end-to-end SMILES Transformer for molecular representation learning boosted by Knowledge Distillation. In order to conduct knowledge transfer from graph Transformers to ST-KD, we have redesigned the attention layers and introduced a pre-transformation step to tokenize the SMILES strings and inject structure-based positional embeddings. Without expensive pretraining, ST-KD shows competitive results on latest standard molecular datasets PCQM4M-LSC and QM9, with 3-14× inference speed compared with existing graph models.

show abstract

“…Task-specific methods require that the teacher be trained for each downstream task. Distilled bidirectional long short-term memory network (Distilled BiLSTM) (Tang et al, 2019), Patient Knowledge Distillation for a BERT model (BERT-PKD) (Sun et al, 2019), and Stacked Internal Distillation (SID) (Aguilar et al, 2020) are all considered as task-specific methods. On the other hand, task-agnostic method uses one teacher for several downstream tasks.…”

Section: Knowledge Distillationmentioning

confidence: 99%

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks are effective feature extractors but they are prohibitively large for deployment scenarios. Due to the huge number of parameters, interpretability of parameters in different layers is not straight-forward. This is why neural networks are sometimes considered black boxes. Although simpler models are easier to explain, finding them is not easy. If found, a sparse network that can fit to a data from scratch would help to interpret parameters of a neural network. To this end, (Frankle and Carbin, 2018) showed that typical dense neural networks contain a small sparse sub-network that can be trained to a reach similar test accuracy in an equal number of steps. The goal of this work is to assess whether such a trainable subnetwork exists for natural language models (NLM)s. To achieve this goal we will review state-of-the-art compression techniques such as quantization, knowledge distillation, and pruning.

show abstract

Patient Knowledge Distillation for BERT Model Compression

Cited by 94 publications

References 0 publications

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Stepping Back to SMILES Transformers for Fast Molecular Representation Inference

On the Compression of Natural Language Models

Contact Info

Product

Resources

About