ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Neumann, Mark E; King, Daniel; Beltagy, Iz; Ammar, Waleed

doi:10.18653/v1/w19-5034

Cited by 419 publications

(304 citation statements)

References 31 publications

Supporting

Mentioning

304

Contrasting

Order By: Relevance

“…The average paper length is 154 sentences (2,769 tokens) resulting in a corpus size of 3.17B tokens, similar to the 3.3B tokens on which BERT was trained. We split sentences using ScispaCy (Neumann et al, 2019), 2 which is optimized for scientific text.…”

Section: Corpusmentioning

confidence: 99%

SciBERT: A Pretrained Language Model for Scientific Text

Beltagy¹,

Cohan

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

1,673

1,390

View full text Add to dashboard Cite

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SCIBERT, a pretrained language model based on BERT (Devlin et al., 2019) to address the lack of high-quality, large-scale labeled scientific data.SCIBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-theart results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

show abstract

Section: Corpusmentioning

confidence: 99%

SciBERT: A Pretrained Language Model for Scientific Text

Beltagy¹,

Cohan

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

1,673

1,390

View full text Add to dashboard Cite

show abstract

“…Since our methods require annotations at the token-level, we also preprocess the dataset with a tokenizer specialized for the medical domain. We use sciSpacy [7] for tokenization and sentence splitting, which is trained for the biomedical domain. Occasionally, sentence splitting errors incurred in misaligned mentions, and thus missing from training and evaluation.…”

Section: Datamentioning

confidence: 99%

MedLinker: Medical Entity Linking with Neural Representations and Dictionary Matching

Loureiro

Jorge

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Progress in the field of Natural Language Processing (NLP) has been closely followed by applications in the medical domain. Recent advancements in Neural Language Models (NLMs) have transformed the field and are currently motivating numerous works exploring their application in different domains. In this paper, we explore how NLMs can be used for Medical Entity Linking with the recently introduced MedMentions dataset, which presents two major challenges: (1) a large target ontology of over 2M concepts, and (2) low overlap between concepts in train, validation and test sets. We introduce a solution, MedLinker, that addresses these issues by leveraging specialized NLMs with Approximate Dictionary Matching, and show that it performs competitively on semantic type linking, while improving the state-of-the-art on the more fine-grained task of concept linking (+4 F1 on MedMentions main task).

show abstract

“…For example, " helper CD4+IL-17-IFN-γhi type 1 cells" is a synonymous reference to the example above but it is extremely unlikely to match against a large list of aliases. For this reason, we sought to determine how well these expression strings are tokenized with common tokenization tools such as ScispaCy [5]. The string " CD4+IL-17-IFN-γhi" cannot be tokenized well without knowledge of protein boundaries so we developed a method hereinafter referred to as "ptkn", which would partition such an example as [CD4 + , IL-17 -, IFN-γ + ] .…”

Section: Expression Signature Tokenizationmentioning

confidence: 99%

“…Beyond common embedding/hidden layer sizes, dropout, learning rate, and pre-trained (using vectors from [7]) vs denovo word embedding hyperparameters, featurization strategies and the presence of positional embeddings were also included in the search. Some of the different featurization strategies included adding special tokens around the entity spans in question for a candidate as in [5] (on ChemProt RE task), using anonymized placemarks for the entities based on type (e.g. using "CYTOKINE" instead of "IL-2"), and using similar placemarks and or special enclosing tokens for entities identified in the candidate sentence but not a part of the relation in question.…”

Section: Trainingmentioning

confidence: 99%

Extracting T Cell Function and Differentiation Characteristics from the Biomedical Literature

Czech

Hammerbacher

2019

Preprint

View full text Add to dashboard Cite

The role of many cytokines and transcription factors in the function and development of human T cells has been the subject of extensive research, however much of this work only demonstrates experimental findings for a relatively small portion of the molecular signaling network that enables the plasticity inherent to these cells. We apply recent advancements in methods for weak supervision and transfer learning for natural language models to aid in extracting these individual findings as 283k cell type, cytokine, and transcription factor relations from 64k relevant documents (53k full-text PMC articles and 11k PubMed abstracts). All data, results and source code available at https://github.com/hammerlab/t-cell-relation-extraction .

show abstract

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Cited by 419 publications

References 31 publications

SciBERT: A Pretrained Language Model for Scientific Text

SciBERT: A Pretrained Language Model for Scientific Text

MedLinker: Medical Entity Linking with Neural Representations and Dictionary Matching

Extracting T Cell Function and Differentiation Characteristics from the Biomedical Literature

Contact Info

Product

Resources

About