NERChem: adapting NERBio to chemical patents via full-token features and named entity feature with chemical sub-class composition

Tsai, Richard Tzong-Han; Hsiao, Yu Cheng; Lai, Po Ting

doi:10.1093/database/baw135

Cited by 4 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Refer to the practice of Tsai et al [31], we employ the GENIA Tagger [32] to process input documents, including tokenization, POS tagging and chunking. All of these provide features for our BiLSTM-CRF model to further enrich the information of each word.…”

Section: Feature Extractionmentioning

confidence: 99%

Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes

et al. 2020

View full text Add to dashboard Cite

Background: Automated biomedical named entity recognition and normalization serves as the basis for many downstream applications in information management. However, this task is challenging due to name variations and entity ambiguity. A biomedical entity may have multiple variants and a variant could denote several different entity identifiers. Results: To remedy the above issues, we present a novel knowledge-enhanced system for protein/gene named entity recognition (PNER) and normalization (PNEN). On one hand, a large amount of entity name knowledge extracted from biomedical knowledge bases is used to recognize more entity variants. On the other hand, structural knowledge of entities is extracted and encoded as identifier (ID) embeddings, which are then used for better entity normalization. Moreover, deep contextualized word representations generated by pre-trained language models are also incorporated into our knowledge-enhanced system for modeling multi-sense information of entities. Experimental results on the BioCreative VI Bio-ID corpus show that our proposed knowledge-enhanced system achieves 0.871 F1-score for PNER and 0.445 F1-score for PNEN, respectively, leading to a new state-of-the-art performance. Conclusions: We propose a knowledge-enhanced system that combines both entity knowledge and deep contextualized word representations. Comparison results show that entity knowledge is beneficial to the PNER and PNEN task and can be well combined with contextualized information in our system for further improvement.

show abstract

Section: Feature Extractionmentioning

confidence: 99%

Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The statistical principle-based approach is used to identify protein mentions and achieved the highest score in terms of the second evaluation metric of the BioCreative V.5 Gene and protein related object recognition (GPRO) task (20). The CRF-based NERChem (21) is used to identify chemical mentions. Finally, the dictionary-based approach is used to recognize disease and biological process mentions by using external dictionaries including Entrez, ChEBI and BEL official dictionaries, which are also used to normalize each recognized NE mention to its database identifier.…”

Section: Methodsmentioning

confidence: 99%

The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track

et al. 2019

Self Cite

View full text Add to dashboard Cite

Knowledge of the molecular interactions of biological and chemical entities and their involvement in biological processes or clinical phenotypes is important for data interpretation. Unfortunately, this knowledge is mostly embedded in the literature in such a way that it is unavailable for automated data analysis procedures. Biological expression language (BEL) is a syntax representation allowing for the structured representation of a broad range of biological relationships. It is used in various situations to extract such knowledge and transform it into BEL networks. To support the tedious and time-intensive extraction work of curators with automated methods, we developed the BEL track within the framework of BioCreative Challenges. Within the BEL track, we provide training data and an evaluation environment to encourage the text mining community to tackle the automatic extraction of complex BEL relationships. In 2017 BioCreative VI, the 2015 BEL track was repeated with new test data. Although only minor improvements in text snippet retrieval for given statements were achieved during this second BEL task iteration, a significant increase of BEL statement extraction performance from provided sentences could be seen. The best performing system reached a 32% F-score for the extraction of complete BEL statements and with the given named entities this increased to 49%. This time, besides rule-based systems, new methods involving hierarchical sequence labeling and neural networks were applied for BEL statement extraction.

show abstract

“…Thus, we find that GPRO mentions were usually substrings of SPBA’s NEs. To identify GPRO mentions, we employ our previous chemical name recognizer, NERChem [17], which bases on the CRF model. Firstly, we employ the GENIATagger [18] to segment every sentence into a sequence of tokens.…”

Section: Methodsmentioning

confidence: 99%

“…Firstly, we employ the GENIATagger [18] to segment every sentence into a sequence of tokens. Then, we run a sub-tokenization module used in our previous work [17] to further segment tokens into sub-tokens. We use the SOBIE tag-scheme which has nine labels including B-GPRO_TYPE_1, I-GPRO_TYPE_1, E-GPRO_TYPE_1, S-GPRO_TYPE_1, B-GPRO_TYPE_2, I-GPRO_TYPE_2, E-GPRO_TYPE_2, and S-GPRO_TYPE_2, and O.…”

Section: Methodsmentioning

confidence: 99%

Statistical principle-based approach for gene and protein related object recognition

et al. 2018

Self Cite

View full text Add to dashboard Cite

The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.

show abstract

NERChem: adapting NERBio to chemical patents via full-token features and named entity feature with chemical sub-class composition

Cited by 4 publications

References 16 publications

Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes

Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes

The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track

Statistical principle-based approach for gene and protein related object recognition

Contact Info

Product

Resources

About