Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.195
|View full text |Cite
|
Sign up to set email alerts
|

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Abstract: Natural Language Processing (NLP) has recently achieved great success by using huge pre-trained models with hundreds of millions of parameters. However, these models suffer from heavy model sizes and high latency such that they cannot be deployed to resourcelimited mobile devices. In this paper, we propose MobileBERT for compressing and accelerating the popular BERT model. Like the original BERT, MobileBERT is task-agnostic, that is, it can be generically applied to various downstream NLP tasks via simple fine… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
349
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 412 publications
(384 citation statements)
references
References 53 publications
2
349
0
Order By: Relevance
“…In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al, 2019;Turc et al, 2019;Sun et al, 2020). Recently, Sanh et al (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM).…”
Section: Distillation Techniquementioning
confidence: 99%
“…In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al, 2019;Turc et al, 2019;Sun et al, 2020). Recently, Sanh et al (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM).…”
Section: Distillation Techniquementioning
confidence: 99%
“…The knowledge distillation approach enables the transfer of knowledge from a large teacher model to a smaller student model. Such attempts have been made to distill BERT models, e.g., Distil-BERT (Sanh et al, 2019), BERT-PKD (Sun et al, 2019), Distilled BiLSTM (Tang et al, 2019), Tiny-BERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), etc. All of these methods require carefully designing the student architecture.…”
Section: Pre-trained Language Model Compressionmentioning
confidence: 99%
“…However, these models often consume considerable storage, memory bandwidth, and computational resource. To reduce the model size and increase the inference throughput, compression techniques such as knowledge distillation (Sanh et al, 2019;Sun et al, 2019;Tang et al, 2019;Jiao et al, 2019;Sun et al, 2020) (Sanh et al, 2019) and BERT-PKD (Sun et al, 2019)) and iterative pruning methods (Iterative Pruning (Guo et al, 2019) and our proposed method) in terms of accuracy at various compression rate using MNLI test set. knowledge distillation methods require re-distillation from the teacher to get each single data point, whereas iterative pruning methods can produce continuous curves at once.…”
Section: Introductionmentioning
confidence: 99%
“…Furthermore, we want to apply different techniques like quantization and distillation to make the models available in the browser. Moreover, we would like to focus on light models like MobileBERT (Sun et al, 2020), retrain it for Arabic and make it readily usable in the browser.…”
Section: Conclusion and Future Plansmentioning
confidence: 99%