MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Sun, Zhiqing; Yu, Hongkun; Xue, Song; Liu, Renjie; Yang, Yiming; Zhou, Denny

doi:10.18653/v1/2020.acl-main.195

Cited by 412 publications

(384 citation statements)

References 53 publications

Supporting

Mentioning

349

Contrasting

Order By: Relevance

“…In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al, 2019;Turc et al, 2019;Sun et al, 2020). Recently, Sanh et al (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM).…”

Section: Distillation Techniquementioning

confidence: 99%

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Lewis¹,

Ott²,

Du³

et al. 2020

Proceedings of the 3rd Clinical Natural Language Processing Workshop

103

View full text Add to dashboard Cite

A large array of pretrained models are available to the biomedical NLP (BioNLP) community. Finding the best model for a particular task can be difficult and time-consuming. For many applications in the biomedical and clinical domains, it is crucial that models can be built quickly and are highly accurate. We present a large-scale study across 18 established biomedical and clinical NLP tasks to determine which of several popular open-source biomedical and clinical NLP models work well in different settings. Furthermore, we apply recent advances in pretraining to train new biomedical language models, and carefully investigate the effect of various design choices on downstream performance. Our best models perform well in all of our benchmarks, and set new State-of-the-Art in 9 tasks. We release these models in the hope that they can help the community to speed up and increase the accuracy of BioNLP and text mining applications.

show abstract

Section: Distillation Techniquementioning

confidence: 99%

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Lewis¹,

Ott²,

Du³

et al. 2020

Proceedings of the 3rd Clinical Natural Language Processing Workshop

103

View full text Add to dashboard Cite

show abstract

“…The knowledge distillation approach enables the transfer of knowledge from a large teacher model to a smaller student model. Such attempts have been made to distill BERT models, e.g., Distil-BERT (Sanh et al, 2019), BERT-PKD (Sun et al, 2019), Distilled BiLSTM (Tang et al, 2019), Tiny-BERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), etc. All of these methods require carefully designing the student architecture.…”

Section: Pre-trained Language Model Compressionmentioning

confidence: 99%

“…However, these models often consume considerable storage, memory bandwidth, and computational resource. To reduce the model size and increase the inference throughput, compression techniques such as knowledge distillation (Sanh et al, 2019;Sun et al, 2019;Tang et al, 2019;Jiao et al, 2019;Sun et al, 2020) (Sanh et al, 2019) and BERT-PKD (Sun et al, 2019)) and iterative pruning methods (Iterative Pruning (Guo et al, 2019) and our proposed method) in terms of accuracy at various compression rate using MNLI test set. knowledge distillation methods require re-distillation from the teacher to get each single data point, whereas iterative pruning methods can produce continuous curves at once.…”

Section: Introductionmentioning

confidence: 99%

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Lin¹,

Liu²,

Yang³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectralnormalized identity priors (SNIP), a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm. It is applicable to any structured module, including a single attention head, an entire attention block, or a feed-forward subnetwork. Furthermore, we introduce spectral normalization to stabilize the distribution of the post-activation values of the Transformer layers, further improving the pruning effectiveness of the proposed methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance. Specifically, we improve the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio. * Work done as part of the Google AI Residency. † Work done at Google Research.

show abstract

“…Furthermore, we want to apply different techniques like quantization and distillation to make the models available in the browser. Moreover, we would like to focus on light models like MobileBERT (Sun et al, 2020), retrain it for Arabic and make it readily usable in the browser.…”

Section: Conclusion and Future Plansmentioning

confidence: 99%

ARBML: Democritizing Arabic Natural Language Processing Tools

Alyafeai¹,

Al-shaibani²

2020

Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

View full text Add to dashboard Cite

Automating natural language understanding is a lifelong quest addressed for decades. With the help of advances in machine learning and particularly, deep learning, we are able to produce state of the art models that can imitate human interactions with languages. Unfortunately, these advances are controlled by the availability of language resources. Arabic advances in this field , although it has a great potential, are still limited. This is apparent in both research and development. In this paper, we showcase some NLP models we trained for Arabic. We also present our methodology and pipeline to build such models from data collection, data preprocessing, tokenization and model deployment. These tools help in the advancement of the field and provide a systematic approach for extending NLP tools to many languages.

show abstract

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Cited by 412 publications

References 53 publications

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

ARBML: Democritizing Arabic Natural Language Processing Tools

Contact Info

Product

Resources

About