Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1374
|View full text |Cite
|
Sign up to set email alerts
|

Small and Practical BERT Models for Sequence Labeling

Abstract: We propose a practical scheme to train a single multilingual sequence labeling model that yields state of the art results and is small and fast enough to run on a single CPU. Starting from a public multilingual BERT checkpoint, our final model is 6x smaller and 27x faster, and has higher accuracy than a state-of-theart multilingual baseline. We show that our model especially outperforms on low-resource languages, and works on codemixed input text without being explicitly trained on codemixed examples. We showc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
72
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 106 publications
(75 citation statements)
references
References 9 publications
2
72
1
Order By: Relevance
“…In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al, 2019;Turc et al, 2019;Sun et al, 2020). Recently, Sanh et al (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM).…”
Section: Distillation Techniquementioning
confidence: 99%
“…In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al, 2019;Turc et al, 2019;Sun et al, 2020). Recently, Sanh et al (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM).…”
Section: Distillation Techniquementioning
confidence: 99%
“…One exciting way to compensate for the lack of unlabeled data in low-resource language varieties is to finetune a large, multilingual language model that has been pretrained on the union of many languages' data Lample and Conneau, 2019). This enables the model to transfer some of what it learns from high-resource languages to low-resource ones, demonstrating benefits over monolingual methods in some cases (Conneau et al, 2020a;Tsai et al, 2019), though not always (Agerri et al, 2020;Rönnqvist et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…Regarding multilingual transformers, Tsai et al (2019) evaluated distilled versions of BERT and mBERT for POS tagging and Morphology tasks. Their version of mBERT is 6 times smaller and 27 times faster but induces an average F1 drop of 1.6% and 5.4% in the two evaluated tasks.…”
Section: Related Workmentioning
confidence: 99%