2018
DOI: 10.1186/s13321-018-0314-7
|View full text |Cite
|
Sign up to set email alerts
|

Statistical principle-based approach for gene and protein related object recognition

Abstract: The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 14 publications
0
5
0
Order By: Relevance
“…Furthermore, in the process of template matching, the scoring mechanism assigns scores according to the relations between matched/unmatched units, insertions/deletions, and the corresponding unit. We adopted logistic regression (LR) (Ng, 2004) to determine whether a template is activated by a sequence pattern since the LR‐based model is efficient in learning the weight of a template for the task of recognizing objects in a text sequence (Lai, Huang, Yang, Hsu, & Tsai, 2018; Lee & Liu, 2003; Liang & Forbus, 2015). The similarity of an alignment of template t and labeled pattern l is calculated by the following formula:normalS()normalt,normall=iλM()the0.25emnormalith0.25emmatched word+jλD()the0.25emnormaljth0.25emdeletion+kλI()the0.25emnormalkth0.25eminserted bigram, where λ M is the weight of the matched words, λ D is the weight of deleted units, and λ I is the weight of the bigram consisting of the insertion and its neighboring left (resp.…”
Section: Methodsmentioning
confidence: 99%
“…Furthermore, in the process of template matching, the scoring mechanism assigns scores according to the relations between matched/unmatched units, insertions/deletions, and the corresponding unit. We adopted logistic regression (LR) (Ng, 2004) to determine whether a template is activated by a sequence pattern since the LR‐based model is efficient in learning the weight of a template for the task of recognizing objects in a text sequence (Lai, Huang, Yang, Hsu, & Tsai, 2018; Lee & Liu, 2003; Liang & Forbus, 2015). The similarity of an alignment of template t and labeled pattern l is calculated by the following formula:normalS()normalt,normall=iλM()the0.25emnormalith0.25emmatched word+jλD()the0.25emnormaljth0.25emdeletion+kλI()the0.25emnormalkth0.25eminserted bigram, where λ M is the weight of the matched words, λ D is the weight of deleted units, and λ I is the weight of the bigram consisting of the insertion and its neighboring left (resp.…”
Section: Methodsmentioning
confidence: 99%
“…In the NER stage, BelSmile consists of an ensemble system composed of three approaches: statistical principle, conditional random fields (CRF) and dictionary-based. The statistical principle-based approach is used to identify protein mentions and achieved the highest score in terms of the second evaluation metric of the BioCreative V.5 Gene and protein related object recognition (GPRO) task (20). The CRF-based NERChem (21) is used to identify chemical mentions.…”
Section: Methodsmentioning
confidence: 99%
“…For example, Po-Ting Lai et al introduced the statistical principlebased approach (SPBA) for named entity recognition and participated in a Bio-Creative V.5 gene-and protein-related object (GPRO) task to evaluate the ability of SPBA for processing patent abstracts. In Bio-Creative V.5 GPRO task, this approach achieved an F-score of 73.73% on GPRO type 1 and an F-score of 78.66% on combining GPRO type 1 and 2 [22]. However, most statistics-based named entity recognition models have high training time complexity, high training cost, and strong dependence on the quality of corpus.…”
Section: Related Workmentioning
confidence: 99%