The CHEMDNER corpus of chemicals and drugs and its annotation principles

Krallinger, Martin; Rabal, Obdulia; Leitner, Florian; Vázquez, Miguél; Salgado, David; Lu, Zhiyong; Leaman, Robert; Lu, Yanan; Ji, Donghong; Lowe, Daniel M.; Sayle, Roger A.; Batista-Navarro, Riza Theresa; Rak, Rafał; Huber, Torsten; Rocktäschel, Tim; Matos, Sérgio; Campos, David; Tang, Buzhou; Xu, Hua; Munkhdalai, Tsendsuren; Ryu, Keun Ho; Ramanan, S. V.; Nathan, Senthil; Žitnik, Slavko; Bajec, Marko; Weber, Lutz; Irmer, Matthias; Akhondi, Saber A.; Kors, Jan A.; Xu, Shuo; An, Xueli; Sikdar, Utpal Kumar; Ekbal, Asif; Yoshioka, Masaharu; Dieb, Thaer M.; Choi, Miji; Verspoor, Karin; Khabsa, Madian; Giles, C. Lee; Liu, Hongfang; Ravikumar, K. E.; Lamurias, Andre; Couto, Francisco M.; Dai, Hong Jie; Tsai, Richard Tzong-Han; Ata, Caglar; Can, Tolga; Usié, Anabel; Alves, Rui; Segura-Bedmar, Isabel; Martı́nez, Paloma; Oyarzábal, Julen; Valencia, Alfonso

doi:10.1186/1758-2946-7-s1-s2

Cited by 213 publications

(200 citation statements)

References 47 publications

Supporting

Mentioning

197

Contrasting

Unclassified

Order By: Relevance

“…For this reason, several efforts have been underway to provide standardized corpora for testing. For example, the CHEMDNER corpus contains 10 K abstracts that have been manually annotated with trivial chemical names (30.36%) systematic names (22.69%), chemical abbreviations (15.55%), chemical formulas (14.26%) and chemical families (14.15%), along with chemical identifiers (2.16%) and text that captures more than one type of chemical entity (0.70%) (Krallinger et al, ). Table 1 summarizes how well some systems identify the entities in the CHEMDNER corpus.…”

Section: Related Workmentioning

confidence: 99%

Empowering citizens to manage their chemical exposures step 1 ‐ identify ingredients in consumer products

Kim

Blake

Gabb

2019

Proceedings of the Association for Information Science and Tech

View full text Add to dashboard Cite

Our choices around consumer products directly influence our amount of chemical exposure. Although access to chemicals within individual products are available, we often use multiple products so ingredient names must be harmonized to accurately estimate cumulative exposure. We evaluated the accuracy and coverage of two strategies, PubChem and tmChem, with respect to a database of 55K products. More than half of the ingredients identified by PubChem were specific chemical names (55%), followed by natural or artificial colors (20%) and plants or plant derivatives (13%). The majority of ingredients identified by tmChem were chemical names (83.9%). Only 1,696 of the 8,247 (20.56%) were identified by both systems. Although tmChem had better coverage, ~70% of ingredients identified by tmChem need further work to align with a specific chemical. Both strategies are needed to provide an accurate, personalized, and cumulative measure of chemical exposure.

show abstract

Section: Related Workmentioning

confidence: 99%

Empowering citizens to manage their chemical exposures step 1 ‐ identify ingredients in consumer products

Kim

Blake

Gabb

2019

Proceedings of the Association for Information Science and Tech

View full text Add to dashboard Cite

show abstract

“…To facilitate the development of new and superior NER systems, BioCreative announced the CHEMDNER challenge, which ended in 2015 [1]. As part of this task, a team of experts has produced an extensive manually annotated corpus covering various chemical entity types, including systematic and trivial names, abbreviations and identifiers, formulae and phrases.…”

Section: Content Backgroundmentioning

confidence: 99%

“…As part of this task, a team of experts has produced an extensive manually annotated corpus covering various chemical entity types, including systematic and trivial names, abbreviations and identifiers, formulae and phrases. Due to many difficulties inherent to chemical entity detection and normalisation [1], even manual annotation yields the interannotator agreement score of 91%, which can be regarded as the theoretical limit for any automatic system trained on this corpus. Twenty six teams have submitted their NER systems for the challenge, best of which have reached the F1 score of ∼ 72 − 88% [2,3,4,5,6,7,8,9] on two subtasks: chemical entity mention (CEM) and chemical document indexing (CDI).…”

Section: Content Backgroundmentioning

confidence: 99%

Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules

Korvigo

Holmatov

Zaikovskii

et al. 2018

Preprint

View full text Add to dashboard Cite

Chemical named entity recognition (NER) is an active field of research in biomedical natural language processing. To facilitate the development of new and superior chemical NER systems, BioCreative released the CHEMDNER corpus, an extensive dataset of diverse manually annotated chemical entities. Most of the systems trained on the corpus rely on complicated hand-crafted rules or curated databases for data preprocessing, feature extraction and output post-processing, though modern machine learning algorithms, such as deep neural networks, can automatically design the rules with little to none human intervention. Here we explored this approach by experimenting with various deep learning architectures for targeted tokenisation and named entity recognition. Our final model, based on a combination of convolutional and stateful recurrent neural networks with attention-like loops and hybrid word-and character-level embeddings, reaches near human-level performance on the testing dataset with no manually asserted rules. To make our model easily accessible for standalone use and integration in third-party software, we've developed a Python package with a minimalistic user interface.

show abstract

“…), recognition and processing of proper names including persons and places, acronyms, numbers, etc., verb conjugation detection and change, negation, single/plural detection and conversion, stemming and normalization procedures, etc. Both come with prebuilt lexicons, i.e., collections of words "understood" by the mining algorithm, which are rather generic but can be extended as required for specific applications by using lexicons built for specific disciplines [58][59][60]. For applications in chemistry and biology, lexicons could in turn be extended by using tools for conversion between molecular formulas, structures and names, annotations and ontologies from databases, possibly even on-the-fly through JavaScript API calls to external web services or by mining knowledge databases such as DBpedia (the structured form mirror of Wikipedia) or Wordnik (a meta-dictionary).…”

Section: Javascript Tools For Handling Strings Text Mining and Lingmentioning

confidence: 99%

Web Apps Come of Age for Molecular Sciences

Abriata

2017

Informatics

View full text Add to dashboard Cite

Whereas server-side programs are essential to maintain databases and run data analysis pipelines and simulations, client-side web-based computing tools are also important as they allow users to access, visualize and analyze the content delivered to their devices on-the-fly and interactively. This article reviews the best-established tools for in-browser plugin-less programming, including JavaScript as used in HTML5 as well as related web technologies. Through examples based on JavaScript libraries, web applets, and even full web apps, either alone or coupled to each other, the article puts on the spotlight the potential of these technologies for carrying out numerical calculations, text processing and mining, retrieval and analysis of data through queries to online databases and web services, effective visualization of data including 3D visualization and even virtual and augmented reality; all of them in the browser at relatively low programming effort, with applications in cheminformatics, structural biology, biophysics, and genomics, among other molecular sciences.

show abstract

The CHEMDNER corpus of chemicals and drugs and its annotation principles

Cited by 213 publications

References 47 publications

Empowering citizens to manage their chemical exposures step 1 ‐ identify ingredients in consumer products

Empowering citizens to manage their chemical exposures step 1 ‐ identify ingredients in consumer products

Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules

Web Apps Come of Age for Molecular Sciences

Contact Info

Product

Resources

About