Proceedings of the 18th BioNLP Workshop and Shared Task 2019
DOI: 10.18653/v1/w19-5034
|View full text |Cite
|
Sign up to set email alerts
|

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

Abstract: Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust, practical, publicly available models. This paper describes scis-paCy, a new Python library and models for practical biomedical/scientific text processing, which heavily leverages the spaCy library. We detail the performance… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
304
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 419 publications
(304 citation statements)
references
References 31 publications
0
304
0
Order By: Relevance
“…The average paper length is 154 sentences (2,769 tokens) resulting in a corpus size of 3.17B tokens, similar to the 3.3B tokens on which BERT was trained. We split sentences using ScispaCy (Neumann et al, 2019), 2 which is optimized for scientific text.…”
Section: Corpusmentioning
confidence: 99%
“…The average paper length is 154 sentences (2,769 tokens) resulting in a corpus size of 3.17B tokens, similar to the 3.3B tokens on which BERT was trained. We split sentences using ScispaCy (Neumann et al, 2019), 2 which is optimized for scientific text.…”
Section: Corpusmentioning
confidence: 99%
“…Since our methods require annotations at the token-level, we also preprocess the dataset with a tokenizer specialized for the medical domain. We use sciSpacy [7] for tokenization and sentence splitting, which is trained for the biomedical domain. Occasionally, sentence splitting errors incurred in misaligned mentions, and thus missing from training and evaluation.…”
Section: Datamentioning
confidence: 99%
“…For example, " helper CD4+IL-17-IFN-γhi type 1 cells" is a synonymous reference to the example above but it is extremely unlikely to match against a large list of aliases. For this reason, we sought to determine how well these expression strings are tokenized with common tokenization tools such as ScispaCy [5]. The string " CD4+IL-17-IFN-γhi" cannot be tokenized well without knowledge of protein boundaries so we developed a method hereinafter referred to as "ptkn", which would partition such an example as [CD4 + , IL-17 -, IFN-γ + ] .…”
Section: Expression Signature Tokenizationmentioning
confidence: 99%
“…Beyond common embedding/hidden layer sizes, dropout, learning rate, and pre-trained (using vectors from [7]) vs denovo word embedding hyperparameters, featurization strategies and the presence of positional embeddings were also included in the search. Some of the different featurization strategies included adding special tokens around the entity spans in question for a candidate as in [5] (on ChemProt RE task), using anonymized placemarks for the entities based on type (e.g. using "CYTOKINE" instead of "IL-2"), and using similar placemarks and or special enclosing tokens for entities identified in the candidate sentence but not a part of the relation in question.…”
Section: Trainingmentioning
confidence: 99%