Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1371
|View full text |Cite
|
Sign up to set email alerts
|

SciBERT: A Pretrained Language Model for Scientific Text

Abstract: Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data.SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dep… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

11
1,579
3
9

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 1,794 publications
(1,602 citation statements)
references
References 23 publications
11
1,579
3
9
Order By: Relevance
“…However, to address several limitations, we choose to train our own clinical BERT model in this work. First, existing models are initialized from BioBERT [39] or BERT BASE [16], though SciBERT [6] outperforms BioBERT on a number of downstream tasks. Secondly, existing models do not satisfactorily encode the personal health identifiers (PHI) within the notes (e.g., [**2126-9-19**]), either leaving them as is, or removing them altogether.…”
Section: Pretrained Clinical Embeddingsmentioning
confidence: 99%
See 3 more Smart Citations
“…However, to address several limitations, we choose to train our own clinical BERT model in this work. First, existing models are initialized from BioBERT [39] or BERT BASE [16], though SciBERT [6] outperforms BioBERT on a number of downstream tasks. Secondly, existing models do not satisfactorily encode the personal health identifiers (PHI) within the notes (e.g., [**2126-9-19**]), either leaving them as is, or removing them altogether.…”
Section: Pretrained Clinical Embeddingsmentioning
confidence: 99%
“…Initialization. Unlike previous approaches, we initialize our model from SciBERT, which has been shown to have better performance on a variety of benchmarking tasks [6].…”
Section: Baseline Clinical Bert Pretrainingmentioning
confidence: 99%
See 2 more Smart Citations
“…We also extend these efforts to incorporate PMC full text articles, which provide access to approximately 6 times as many relations per relevant document than PubMed abstracts (5.1 relations per full-text article vs .8 relations per abstract only article when pooling across all 3 relation types), as well as transcription factors after lending credibility to the shared relation types through comparison with those first published in iX. Finally, we compare a weakly supervised approach, using Snorkel [2], for relation extraction (RE) to one based on transfer learning, using SciBERT [3], in order to evaluate what advantages, if any, arise from the auxiliary engineering effort inherent to weak supervision. In summary, this study is intended to demonstrate the viability of weak supervision for biological relation extraction in scientific literature as well as share a large database of T cell-specific cytokine and transcription factor relationships.…”
Section: Introductionmentioning
confidence: 99%