2006
DOI: 10.1186/1471-2105-7-s5-s11
|View full text |Cite
|
Sign up to set email alerts
|

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Abstract: Background: Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
97
0

Year Published

2007
2007
2022
2022

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 104 publications
(98 citation statements)
references
References 23 publications
1
97
0
Order By: Relevance
“…While many systems use some form of stemming, BANNER instead employs lemmatization [16], which is similar in purpose except that words are converted into their base form instead of simply removing the suffix. Also notable is the numeric normalization feature [15], which replaces the digits in each token with a representative digit (e.g. "0").…”
Section: Architecturementioning
confidence: 99%
“…While many systems use some form of stemming, BANNER instead employs lemmatization [16], which is similar in purpose except that words are converted into their base form instead of simply removing the suffix. Also notable is the numeric normalization feature [15], which replaces the digits in each token with a representative digit (e.g. "0").…”
Section: Architecturementioning
confidence: 99%
“…For example, the adjective "human" appears in at the beginning of 657 cell types, 213 proteins and 354 DNA entities, but was missed in other 96 equivalent cases: 31 for cell types, 29 for proteins and 25 for DNAs. A description of other inconsistences and annotation problems detected in the training set can be found in [47] and [45].…”
Section: Results On the Jnlpba'04 Challenge Datasetmentioning
confidence: 99%
“…We also apply suspending hyphen rules to expanded series such as "GlcNAc6ST-1, -2, and -3" to "GlcNAc6ST-1, GlcNAc6ST-2 and GlcNAc6ST-3". Next, our NERBio system [4] identifies all gene mentions in the given text. Finally, four post-processing rules are employed to identify more gene mentions.…”
Section: Gene Mention Recognitionmentioning
confidence: 99%