ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

Akkasi, Abbas; Varoğlu, Ekrem; Dimililer, Nazife

doi:10.1155/2016/4248026

Cited by 22 publications

(16 citation statements)

References 18 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The effect of tokenization on NER performance has been shown in the past (Akkasi et al, 2016;Xu et al, 2018). For this reason, we evaluated our model trained on the original training data, using various custom tokenization functions, and saw the strict micro-F1 score vary from 72% to 77% in the validation set.…”

Section: Effects Of Tokenizationmentioning

confidence: 94%

University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities

Yadav¹,

Laparra²,

Wang³

et al. 2019

Proceedings of the 13th International Workshop on Semantic Evaluation

View full text Add to dashboard Cite

We present the Named Entity Recognition (NER) and disambiguation model used by the University of Arizona team (UArizona) for SemEval 2019 task 12. We achieved fourth place on tasks 1 and 3. We implemented a deep-affix based LSTM-CRF NER model for task 1, which utilizes only character, word, prefix and suffix information for the identification of geolocation entities. Despite using just the training data provided by task organizers and not using any lexicon features, we achieved 78.85% strict micro F-score on task 1. We used the unsupervised population heuristics for task 3 and achieved 52.99% strict micro-F1 score in this task.

show abstract

Section: Effects Of Tokenizationmentioning

confidence: 94%

University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities

Yadav¹,

Laparra²,

Wang³

et al. 2019

Proceedings of the 13th International Workshop on Semantic Evaluation

View full text Add to dashboard Cite

show abstract

“…To investigate the effect of a general domain tokenizer, following Habibi et al (2017), we also use the OpenNLP tokenizer. To investigate whether NER performance will be affected by tokenization quality, we employ three tokenizers optimized for chemical texts including ChemTok (Akkasi et al, 2016), OSCAR4 (Jessop et al, 2011) and NBIC UMLSGeneChemTokenizer. 1…”

Section: Tokenizersmentioning

confidence: 99%

Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Zhai¹,

Nguyen²,

Akhondi³

et al. 2019

Proceedings of the 18th BioNLP Workshop and Shared Task

View full text Add to dashboard Cite

Chemical patents are an important resource for chemical information. However, few chemical Named Entity Recognition (NER) systems have been evaluated on patent documents, due in part to their structural and linguistic complexity. In this paper, we explore the NER performance of a BiLSTM-CRF model utilising pre-trained word embeddings, characterlevel word representations and contextualized ELMo word representations for chemical patents. We compare word embeddings pre-trained on biomedical and chemical patent corpora. The effect of tokenizers optimized for the chemical domain on NER performance in chemical patents is also explored. The results on two patent corpora show that contextualized word representations generated from ELMo substantially improve chemical NER performance w.r.t. the current state-of-the-art. We also show that domain-specific resources such as word embeddings trained on chemical patents and chemical-specific tokenizers have a positive impact on NER performance.

show abstract

“…For example, Riaz [11] developed an Urdu rule-based NER system that designed Urdu language pattern rules, such as the honorific title for person entities, suffix rules for location entities, and so on. Akkasi et al [12] constructed chemical-specific affixes (e.g., Hyper, Anti, and Amino) to detect the beginnings of mentions and used merged rules to detect the endings of mentions. Salah and Zakaria [20] summarized Arabic rulebased NER systems for Arabic language writing patterns, including grammar rules, heuristics rules, and morphological rules.…”

Section: A Entity Discoverymentioning

confidence: 99%

“…The rules are domain-independent because they are generated from the structural information of question representations. But common rule-based ED methods [11], [12] rely on the characteristics of entity types (e.g., persons, organizations, and locations) to build rules. The mention generation module also integrates the extracted mentions into an ED model to alleviate the insufficiency of annotated datasets.…”

Section: Introductionmentioning

confidence: 99%

Progressive Joint Framework for Chinese Question Entity Discovery and Linking With Question Representations

Lin

Zhang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

Chinese question entity discovery and linking (QEDL) may encounter short texts and smallscale annotated datasets, which may invalidate certain machine learning algorithms. In this paper, we propose a progressive joint framework for Chinese QEDL, which leverages the mutual dependency information of these two tasks to enhance the performance with each other. The framework uses the candidate entity generation (CEG) of entity linking to iteratively augment the overall process of entity discovery that consists of mention generation, filtering and merging modules. In mention generation module, to reduce the handcrafted effort of the rule-based entity discovery, we develop a question representation method to generate domain-independent entity discovery rules, and use CEG to check the extracted mentions in priority order. This module can embed extracted mentions into other entity discovery methods as one feature or as extra mentions to alleviate insufficiencies of annotated datasets. The mentions filtering module leverages the joint features of extracted mentions and CEG's entities to build a voting model and filter out low-confidence mentions. Moreover, the mentions merging module merges different patterns' mention-entity pairs and check their corresponding candidate entities with CEG. During entity linking, we incorporate the joint features of questions, extracted mentions and CEG's entities into a ranking model for entity disambiguation. Finally, we conduct experiments on two real datasets and compare our approach with other state-of-the-art methods. The results illustrate that the proposed framework can reduce error accumulation and flexibly combine different entity discovery methods, which significantly improves the performance on small-scale datasets.INDEX TERMS Entity discovery and linking, information extraction, joint method, natural language processing, question representation model.

show abstract

ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

Cited by 22 publications

References 18 publications

University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities

University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities

Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Progressive Joint Framework for Chinese Question Entity Discovery and Linking With Question Representations

Contact Info

Product

Resources

About