2015
DOI: 10.1186/1758-2946-7-s1-s2
|View full text |Cite
|
Sign up to set email alerts
|

The CHEMDNER corpus of chemicals and drugs and its annotation principles

Abstract: The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 che… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
197
0
2

Year Published

2016
2016
2021
2021

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 213 publications
(200 citation statements)
references
References 47 publications
1
197
0
2
Order By: Relevance
“…For this reason, several efforts have been underway to provide standardized corpora for testing. For example, the CHEMDNER corpus contains 10 K abstracts that have been manually annotated with trivial chemical names (30.36%) systematic names (22.69%), chemical abbreviations (15.55%), chemical formulas (14.26%) and chemical families (14.15%), along with chemical identifiers (2.16%) and text that captures more than one type of chemical entity (0.70%) (Krallinger et al, ). Table 1 summarizes how well some systems identify the entities in the CHEMDNER corpus.…”
Section: Related Workmentioning
confidence: 99%
“…For this reason, several efforts have been underway to provide standardized corpora for testing. For example, the CHEMDNER corpus contains 10 K abstracts that have been manually annotated with trivial chemical names (30.36%) systematic names (22.69%), chemical abbreviations (15.55%), chemical formulas (14.26%) and chemical families (14.15%), along with chemical identifiers (2.16%) and text that captures more than one type of chemical entity (0.70%) (Krallinger et al, ). Table 1 summarizes how well some systems identify the entities in the CHEMDNER corpus.…”
Section: Related Workmentioning
confidence: 99%
“…To facilitate the development of new and superior NER systems, BioCreative announced the CHEMDNER challenge, which ended in 2015 [1]. As part of this task, a team of experts has produced an extensive manually annotated corpus covering various chemical entity types, including systematic and trivial names, abbreviations and identifiers, formulae and phrases.…”
Section: Content Backgroundmentioning
confidence: 99%
“…As part of this task, a team of experts has produced an extensive manually annotated corpus covering various chemical entity types, including systematic and trivial names, abbreviations and identifiers, formulae and phrases. Due to many difficulties inherent to chemical entity detection and normalisation [1], even manual annotation yields the interannotator agreement score of 91%, which can be regarded as the theoretical limit for any automatic system trained on this corpus. Twenty six teams have submitted their NER systems for the challenge, best of which have reached the F1 score of ∼ 72 − 88% [2,3,4,5,6,7,8,9] on two subtasks: chemical entity mention (CEM) and chemical document indexing (CDI).…”
Section: Content Backgroundmentioning
confidence: 99%
“…), recognition and processing of proper names including persons and places, acronyms, numbers, etc., verb conjugation detection and change, negation, single/plural detection and conversion, stemming and normalization procedures, etc. Both come with prebuilt lexicons, i.e., collections of words "understood" by the mining algorithm, which are rather generic but can be extended as required for specific applications by using lexicons built for specific disciplines [58][59][60]. For applications in chemistry and biology, lexicons could in turn be extended by using tools for conversion between molecular formulas, structures and names, annotations and ontologies from databases, possibly even on-the-fly through JavaScript API calls to external web services or by mining knowledge databases such as DBpedia (the structured form mirror of Wikipedia) or Wordnik (a meta-dictionary).…”
Section: Javascript Tools For Handling Strings Text Mining and Lingmentioning
confidence: 99%