MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Sun, Zhiqing; Yu, Hongkun; Xue, Song; Liu, Renjie; Yang, Yiming; Zhou, Denny

doi:10.48550/arxiv.2004.02984

Cited by 92 publications

(131 citation statements)

References 23 publications

Supporting

Mentioning

109

Contrasting

Order By: Relevance

“…In effect, one teacher can train multiple students. Methods such as DistilBERT (Sanh et al, 2019), TinyBERT (Jiao et al, 2019) and MobileBERT (Sun et al, 2020) are task-specific methods. Note that process of task-agnostic BERT distillation is computationally expensive (McCarley et al, 2019) because the corpus used in the distillation is sizable and for each training step a forward process of teacher model and a forward-backward process of student model should be performed.…”

Section: Knowledge Distillationmentioning

confidence: 99%

“…To this end, Tiny-BERT, MobileBERT, and SID all try to improve BERT-PKD by distilling more internal representations to the student, such as embedding layers and attention weights. TinyBERT (Jiao et al, 2019) and MobileBERT (Sun et al, 2020) are small student models distilled from larger pretrained transformers and can achieve good GLUE (Wang et al, 2018) scores. However, these models require spending substantial compute to pre-train the larger teacher model.…”

Section: The Larger the Teacher The Bettermentioning

confidence: 99%

See 1 more Smart Citation

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

Deep neural networks are effective feature extractors but they are prohibitively large for deployment scenarios. Due to the huge number of parameters, interpretability of parameters in different layers is not straight-forward. This is why neural networks are sometimes considered black boxes. Although simpler models are easier to explain, finding them is not easy. If found, a sparse network that can fit to a data from scratch would help to interpret parameters of a neural network. To this end, (Frankle and Carbin, 2018) showed that typical dense neural networks contain a small sparse sub-network that can be trained to a reach similar test accuracy in an equal number of steps. The goal of this work is to assess whether such a trainable subnetwork exists for natural language models (NLM)s. To achieve this goal we will review state-of-the-art compression techniques such as quantization, knowledge distillation, and pruning.

show abstract

Section: Knowledge Distillationmentioning

confidence: 99%

Section: The Larger the Teacher The Bettermentioning

confidence: 99%

On the Compression of Natural Language Models

Damadi¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…But these models are not suitable for devices where memory footprint and response time are constrained. Recently there has been ongoing research to reduce the memory footprint of BERT [20], but still, the model size is around 100 MB and hence we do not consider the same for our evaluation. In this paper, we present a memory-efficient emoji prediction model that requires zero network usage and is completely on-device, preventing any data privacy issues.…”

Section: B Emoji Predictionmentioning

confidence: 99%

VoiceMoji: A Novel On-Device Pipeline for Seamless Emoji Insertion in Dictation

Kumar,

Arora

2021

Preprint

View full text Add to dashboard Cite

Most of the speech recognition systems recover only words in the speech and fail to capture emotions. Users have to manually add emoji(s) in text for adding tone and making communication fun. Though there is much work done on punctuation addition on transcribed speech, the area of emotion addition is untouched. In this paper, we propose a novel on-device pipeline to enrich the voice input experience. It involves, given a blob of transcribed text, intelligently processing and identifying structure where emoji insertion makes sense. Moreover, it includes semantic text analysis to predict emoji for each of the sub-parts for which we propose a novel architecture Attention-based Char Aware (ACA) LSTM which handles Out-Of-Vocabulary (OOV) words as well. All these tasks are executed completely on-device and hence can aid on-device dictation systems. To the best of our knowledge, this is the first work that shows how to add emoji(s) in the transcribed text. We demonstrate that our components achieve comparable results to previous neural approaches for punctuation addition and emoji prediction with 80% fewer parameters. Overall, our proposed model has a very small memory footprint of a mere 4MB to suit on-device deployment.

show abstract

“…Although the size of BERT family models is usually smaller than the GPT family, compressing the BERT family has been investigated much more in the literature (e.g. DistilBERT (Sanh et al, 2019), TinyBERT (Jiao et al, 2019), MobileBERT (Sun et al, 2020), ALP-KD (Passban et al, 2021), MATE-KD (Rashid et al, 2021), Annealing-KD (Jafari et al, 2021) and BERTQuant (Zhang et al, 2020)). On the other hand, to the best of our knowledge, the GPT family has barely a handful of compressed models, among them the DistilGPT2 1 model is very prominent.…”

Section: Introductionmentioning

confidence: 99%

Kronecker Decomposition for GPT Compression

Edalati¹,

Tahaei²,

Ahmad³

et al. 2021

Preprint

View full text Add to dashboard Cite

GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from 100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model ( KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both language modeling and General Language Understanding Evaluation benchmark tasks and show that with more efficient pretraining and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.

show abstract

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Cited by 92 publications

References 23 publications

On the Compression of Natural Language Models

On the Compression of Natural Language Models

VoiceMoji: A Novel On-Device Pipeline for Seamless Emoji Insertion in Dictation

Kronecker Decomposition for GPT Compression

Contact Info

Product

Resources

About