2020
DOI: 10.48550/arxiv.2004.02984
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
109
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 92 publications
(131 citation statements)
references
References 23 publications
1
109
0
Order By: Relevance
“…In effect, one teacher can train multiple students. Methods such as DistilBERT (Sanh et al, 2019), TinyBERT (Jiao et al, 2019) and MobileBERT (Sun et al, 2020) are task-specific methods. Note that process of task-agnostic BERT distillation is computationally expensive (McCarley et al, 2019) because the corpus used in the distillation is sizable and for each training step a forward process of teacher model and a forward-backward process of student model should be performed.…”
Section: Knowledge Distillationmentioning
confidence: 99%
See 1 more Smart Citation
“…In effect, one teacher can train multiple students. Methods such as DistilBERT (Sanh et al, 2019), TinyBERT (Jiao et al, 2019) and MobileBERT (Sun et al, 2020) are task-specific methods. Note that process of task-agnostic BERT distillation is computationally expensive (McCarley et al, 2019) because the corpus used in the distillation is sizable and for each training step a forward process of teacher model and a forward-backward process of student model should be performed.…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…To this end, Tiny-BERT, MobileBERT, and SID all try to improve BERT-PKD by distilling more internal representations to the student, such as embedding layers and attention weights. TinyBERT (Jiao et al, 2019) and MobileBERT (Sun et al, 2020) are small student models distilled from larger pretrained transformers and can achieve good GLUE (Wang et al, 2018) scores. However, these models require spending substantial compute to pre-train the larger teacher model.…”
Section: The Larger the Teacher The Bettermentioning
confidence: 99%
“…But these models are not suitable for devices where memory footprint and response time are constrained. Recently there has been ongoing research to reduce the memory footprint of BERT [20], but still, the model size is around 100 MB and hence we do not consider the same for our evaluation. In this paper, we present a memory-efficient emoji prediction model that requires zero network usage and is completely on-device, preventing any data privacy issues.…”
Section: B Emoji Predictionmentioning
confidence: 99%
“…Although the size of BERT family models is usually smaller than the GPT family, compressing the BERT family has been investigated much more in the literature (e.g. DistilBERT (Sanh et al, 2019), TinyBERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), ALP-KD (Passban et al, 2021), MATE-KD (Rashid et al, 2021), Annealing-KD (Jafari et al, 2021) and BERTQuant (Zhang et al, 2020)). On the other hand, to the best of our knowledge, the GPT family has barely a handful of compressed models, among them the DistilGPT2 1 model is very prominent.…”
Section: Introductionmentioning
confidence: 99%