Small and Practical BERT Models for Sequence Labeling

Tsai, Henry; Riesa, Jason; Johnson, Melvin; Arivazhagan, Naveen; Li, Xin; Archer, Amelia

doi:10.18653/v1/d19-1374

Cited by 106 publications

(75 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In NLP, prior work has exploring distilling larger BERT-like models into smaller ones. Most of this work trains the student network to mimic a teacher that has already been finetuned for a specific task, i.e., task-specific distillation (Tsai et al, 2019;Turc et al, 2019;Sun et al, 2020). Recently, Sanh et al (2020) showed that it is also possible to distill BERT-like models in a task-agnostic way by training the student to mimic the teacher's outputs and activations on the pretraining objective, i.e., masked language modeling (MLM).…”

Section: Distillation Techniquementioning

confidence: 99%

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Lewis¹,

Ott²,

Du³

et al. 2020

Proceedings of the 3rd Clinical Natural Language Processing Workshop

View full text Add to dashboard Cite

A large array of pretrained models are available to the biomedical NLP (BioNLP) community. Finding the best model for a particular task can be difficult and time-consuming. For many applications in the biomedical and clinical domains, it is crucial that models can be built quickly and are highly accurate. We present a large-scale study across 18 established biomedical and clinical NLP tasks to determine which of several popular open-source biomedical and clinical NLP models work well in different settings. Furthermore, we apply recent advances in pretraining to train new biomedical language models, and carefully investigate the effect of various design choices on downstream performance. Our best models perform well in all of our benchmarks, and set new State-of-the-Art in 9 tasks. We release these models in the hope that they can help the community to speed up and increase the accuracy of BioNLP and text mining applications.

show abstract

Section: Distillation Techniquementioning

confidence: 99%

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Lewis¹,

Ott²,

Du³

et al. 2020

Proceedings of the 3rd Clinical Natural Language Processing Workshop

View full text Add to dashboard Cite

show abstract

“…One exciting way to compensate for the lack of unlabeled data in low-resource language varieties is to finetune a large, multilingual language model that has been pretrained on the union of many languages' data Lample and Conneau, 2019). This enables the model to transfer some of what it learns from high-resource languages to low-resource ones, demonstrating benefits over monolingual methods in some cases (Conneau et al, 2020a;Tsai et al, 2019), though not always (Agerri et al, 2020;Rönnqvist et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Chau

Lin

Smith

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse lowresource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.

show abstract

“…Regarding multilingual transformers, Tsai et al (2019) evaluated distilled versions of BERT and mBERT for POS tagging and Morphology tasks. Their version of mBERT is 6 times smaller and 27 times faster but induces an average F1 drop of 1.6% and 5.4% in the two evaluated tasks.…”

Section: Related Workmentioning

confidence: 99%

Load What You Need: Smaller Versions of Mutililingual BERT

Abdaoui¹,

Pradel²,

Sigel³

2020

Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Pre-trained Transformer-based models are achieving state-of-the-art results on a variety of Natural Language Processing data sets. However, the size of these models is often a drawback for their deployment in real production applications. In the case of multilingual models, most of the parameters are located in the embeddings layer. Therefore, reducing the vocabulary size should have an important impact on the total number of parameters. In this paper, we propose to generate smaller models that handle fewer number of languages according to the targeted corpora. We present an evaluation of smaller versions of multilingual BERT on the XNLI data set, but we believe that this method may be applied to other multilingual transformers. The obtained results confirm that we can generate smaller models that keep comparable results, while reducing up to 45% of the total number of parameters. We compared our models with DistilmBERT (a distilled version of multilingual BERT) and showed that unlike language reduction, distillation induced a 1.7% to 6% drop in the overall accuracy on the XNLI data set. The presented models and code are publicly available.

show abstract

Small and Practical BERT Models for Sequence Labeling

Cited by 106 publications

References 9 publications

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Load What You Need: Smaller Versions of Mutililingual BERT

Contact Info

Product

Resources

About