2005
DOI: 10.1186/1471-2105-6-s1-s3
|View full text |Cite
|
Sign up to set email alerts
|

GENETAG: a tagged corpus for gene/protein named entity recognition

Abstract: Background: Named entity recognition (NER) is an important first step for text mining the biomedical literature. Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus. The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE ® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
119
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 165 publications
(123 citation statements)
references
References 5 publications
0
119
0
Order By: Relevance
“…Whenever new annotators joined the project, they had to be trained using previously annotated examples and follow the guideline. Colosimo et al [5] and Tanabe et al [28] also conduct corpus annotation in the biology domain and conclude that clear annotation guidelines are important, and the annotations should be validated by proper interannotator-agreement experiments.…”
Section: How To Annotate Properly: What Have We Learnt?mentioning
confidence: 99%
See 1 more Smart Citation
“…Whenever new annotators joined the project, they had to be trained using previously annotated examples and follow the guideline. Colosimo et al [5] and Tanabe et al [28] also conduct corpus annotation in the biology domain and conclude that clear annotation guidelines are important, and the annotations should be validated by proper interannotator-agreement experiments.…”
Section: How To Annotate Properly: What Have We Learnt?mentioning
confidence: 99%
“…Firstly, it is difficult to choose the right category (e.g., Ben Nevis can refer to person or a mountain in the UK); secondly, it is difficult to select the candidate texts and delimitation boundaries (e.g., should we annotate proper nouns only, or also pronouns and definitional descriptions); thirdly, how to annotate homonyms, e.g., "England" may refer to a location or a football team. These problems become even harder to resolve within specialised domains such as bioinformatics and engineering, due to the intrinsic complexity of terms in these domains including multi-word expressions, complex noun phrase compositions, acronyms, ambiguities and so on [28]. Typically, the inter-annotator agreement in NER found in these domains is between 60% and 80% [29][5] [21].…”
Section: Annotator Discrepancymentioning
confidence: 99%
“…Examples include GENIA (Kim et al, 2008), BioInfer (Pyysalo et al, 2007) GREC ( Thompson et al, 2009), PennBioIE (Kulick et al, 2004), GENETAG (Tanabe et al, 2005) and LLL'05 (Hakenberg et al, 2005). However, none of these corpora is annotated with the types of entities and relationships that are relevant to the study of phenotype information.…”
Section: Related Workmentioning
confidence: 99%
“…We used two well-known corpora from the literature that have frequently been used as benchmark in several papers and challenges: GeneTag from Genia dataset by Tanabe et al (2005) and BioCreative dataset from Yeh et al (2005) (the best F-score for gene/protein name extraction on these corpora are respectively 77.8% and 80%). Furthermore, we consider a very large corpus to fully benefit from scalability of the proposed pattern mining techniques.…”
Section: Motivating Examplementioning
confidence: 99%