GENETAG: a tagged corpus for gene/protein named entity recognition

Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H.; Matten, Wayne T.; Wilbur, W. John

doi:10.1186/1471-2105-6-s1-s3

Cited by 165 publications

(123 citation statements)

References 5 publications

Supporting

Mentioning

119

Contrasting

Order By: Relevance

“…Whenever new annotators joined the project, they had to be trained using previously annotated examples and follow the guideline. Colosimo et al [5] and Tanabe et al [28] also conduct corpus annotation in the biology domain and conclude that clear annotation guidelines are important, and the annotations should be validated by proper interannotator-agreement experiments.…”

Section: How To Annotate Properly: What Have We Learnt?mentioning

confidence: 99%

“…Firstly, it is difficult to choose the right category (e.g., Ben Nevis can refer to person or a mountain in the UK); secondly, it is difficult to select the candidate texts and delimitation boundaries (e.g., should we annotate proper nouns only, or also pronouns and definitional descriptions); thirdly, how to annotate homonyms, e.g., "England" may refer to a location or a football team. These problems become even harder to resolve within specialised domains such as bioinformatics and engineering, due to the intrinsic complexity of terms in these domains including multi-word expressions, complex noun phrase compositions, acronyms, ambiguities and so on [28]. Typically, the inter-annotator agreement in NER found in these domains is between 60% and 80% [29][5] [21].…”

Section: Annotator Discrepancymentioning

confidence: 99%

See 1 more Smart Citation

A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality

Zhang

Chapman

Ciravegna

2010

Knowledge Engineering and Management by the Masses

View full text Add to dashboard Cite

Abstract. Manual document annotation is an essential technique for knowledge acquisition and capture. Creating high-quality annotations is a difficult task due to inter-annotator discrepancy, the problem that annotators can never agree completely on what and exactly how to annotate. To address this, traditional document annotation involves multiple domain experts working on the same annotation task in an iterative and collaborative manner to identify and resolve discrepancies progressively. However, such a detailed process is often ineffective despite taking significant time and effort; unfortunately, discrepancies remain high in many cases. This paper proposes an alternative approach to document annotation. The approach tackles the problem by firstly studying annotators' suitability based on the types of information to be annotated; then identifying and isolating the most inconsistent annotators who tend to cause the majority of discrepancies in a task; finally distributing annotation workload among the most suitable annotators. Tested in a named entity annotation task in the domain of archaeology, we show that compared to the traditional approach to document annotation, it produces larger amounts of better quality annotations that result in higher machine learning accuracy while requires significantly less time and effort.

show abstract

Section: How To Annotate Properly: What Have We Learnt?mentioning

confidence: 99%

Section: Annotator Discrepancymentioning

confidence: 99%

A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality

Zhang

Chapman

Ciravegna

2010

Knowledge Engineering and Management by the Masses

View full text Add to dashboard Cite

show abstract

“…Examples include GENIA (Kim et al, 2008), BioInfer (Pyysalo et al, 2007) GREC ( Thompson et al, 2009), PennBioIE (Kulick et al, 2004), GENETAG (Tanabe et al, 2005) and LLL'05 (Hakenberg et al, 2005). However, none of these corpora is annotated with the types of entities and relationships that are relevant to the study of phenotype information.…”

Section: Related Workmentioning

confidence: 99%

Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)

Alnazzawi¹

2014

View full text Add to dashboard Cite

Narrative information in Electronic Health Records (EHRs) and literature articles contains a wealth of clinical information about treatment, diagnosis, medication and family history. This often includes detailed phenotype information for specific diseases, which in turn can help to identify risk factors and thus determine the susceptibility of different patients. Such information can help to improve healthcare applications, including Clinical Decision Support Systems (CDS). Clinical text mining (TM) tools can provide efficient automated means to extract and integrate vital information hidden within the vast volumes of available text. Development or adaptation of TM tools is reliant on the availability of annotated training corpora, although few such corpora exist for the clinical domain. In response, we have created a new annotated corpus (PhenoCHF), focussing on the identification of phenotype information for a specific clinical sub-domain, i.e., congestive heart failure (CHF). The corpus is unique in this domain, in its integration of information from both EHRs (300 discharge summaries) and literature articles (5 full-text papers). The annotation scheme, whose design was guided by a domain expert, includes both entities and relations pertinent to CHF. Two further domain experts performed the annotation, resulting in high quality annotation, with agreement rates up to 0.92 F-Score.

show abstract

“…We used two well-known corpora from the literature that have frequently been used as benchmark in several papers and challenges: GeneTag from Genia dataset by Tanabe et al (2005) and BioCreative dataset from Yeh et al (2005) (the best F-score for gene/protein name extraction on these corpora are respectively 77.8% and 80%). Furthermore, we consider a very large corpus to fully benefit from scalability of the proposed pattern mining techniques.…”

Section: Motivating Examplementioning

confidence: 99%

Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern

Plantevit

Charnois

Kléma

et al. 2009

IJDMMM

View full text Add to dashboard Cite

Biomedical named entity recognition (NER) is a challenging problem. In this paper, we show that mining techniques, such as sequential pattern mining and sequential rule mining, can be useful to tackle this problem but present some limitations. We demonstrate and analyse these limitations and introduce a new kind of pattern called LSR pattern that offers an excellent trade-off between the high precision of sequential rules and the high recall of sequential patterns. We formalise the LSR pattern mining problem first. Then we show how LSR patterns enable us to successfully tackle biomedical NER problems. We report experiments carried out on real datasets that underline the relevance of our proposition.

show abstract

GENETAG: a tagged corpus for gene/protein named entity recognition

Cited by 165 publications

References 5 publications

A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality

A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality

Proceedings of the 5th International Workshop on Health Text Mining and Information Analysis (Louhi)

Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern

Contact Info

Product

Resources

About