BioCreAtIvE Task 1A: gene mention finding evaluation

Yeh, Alexander; Morgan, Alexander A.; Colosimo, Marc E.; Hirschman, Lynette

doi:10.1186/1471-2105-6-s1-s2

Cited by 148 publications

(124 citation statements)

References 13 publications

Supporting

Mentioning

124

Contrasting

Order By: Relevance

“…NLPBA (Kim et al, 2004) is a large collection of biomedical abstracts annotated with five entities of interest, such as protein, RNA, and cell-type. BioCreative (Yeh et al, 2005) and FlySlip (Vlachos, 2007) also comprise texts in the biomedical domain, annotated for gene entity mentions in articles from the human and fruit fly literature, respectively. CORA (Peng and McCallum, 2004) consists of two collections: a set of research paper headers annotated for entities such as title, author, and institution; and a collection of references annotated with BibTeX fields such as journal, year, and publisher.…”

Section: Methodsmentioning

confidence: 99%

An analysis of active learning strategies for sequence labeling tasks

Settles

Craven

2008

Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08

776

634

View full text Add to dashboard Cite

Active learning is well-suited to many problems in natural language processing, where unlabeled data may be abundant but annotation is slow and expensive. This paper aims to shed light on the best active learning approaches for sequence labeling tasks such as information extraction and document segmentation. We survey previously used query selection strategies for sequence models, and propose several novel algorithms to address their shortcomings. We also conduct a large-scale empirical comparison using multiple corpora, which demonstrates that our proposed methods advance the state of the art.

show abstract

Section: Methodsmentioning

confidence: 99%

An analysis of active learning strategies for sequence labeling tasks

Settles

Craven

2008

Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08

776

634

View full text Add to dashboard Cite

show abstract

“…We report experiments performed on real datasets described in Section 2: BioCreative (Yeh et al, 2005, cf. Figure 5), Genia (Tanabe et al, 2005, cf.…”

Section: Methodsmentioning

confidence: 99%

“…We used two well-known corpora from the literature that have frequently been used as benchmark in several papers and challenges: GeneTag from Genia dataset by Tanabe et al (2005) and BioCreative dataset from Yeh et al (2005) (the best F-score for gene/protein name extraction on these corpora are respectively 77.8% and 80%). Furthermore, we consider a very large corpus to fully benefit from scalability of the proposed pattern mining techniques.…”

Section: Motivating Examplementioning

confidence: 99%

Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern

Plantevit

Charnois

Kléma

et al. 2009

IJDMMM

View full text Add to dashboard Cite

Biomedical named entity recognition (NER) is a challenging problem. In this paper, we show that mining techniques, such as sequential pattern mining and sequential rule mining, can be useful to tackle this problem but present some limitations. We demonstrate and analyse these limitations and introduce a new kind of pattern called LSR pattern that offers an excellent trade-off between the high precision of sequential rules and the high recall of sequential patterns. We formalise the LSR pattern mining problem first. Then we show how LSR patterns enable us to successfully tackle biomedical NER problems. We report experiments carried out on real datasets that underline the relevance of our proposition.

show abstract

“…protein interactions), the automatic classification of texts, and the generation of new hypotheses on the basis of the available literature [3]. The BioCreAtIvE contest [21] nicely shows, that even sophisticated tools for text mining have a considerable lack of precision and recall: For a simple "named entity recognition"-task the precision ranged up to 86% and the recall was at most 84%. Another attempt is described in [4]: Information about protein-interactions was extracted from a data set of 1.2 million sentences that were taken from biomedical abstracts.…”

Section: Motivationmentioning

confidence: 99%

Improving Text Mining with Controlled Natural Language: A Case Study for Protein Interactions

Kuhn

Royer

Fuchs

et al. 2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Linking the biomedical literature to other data resources is notoriously difficult and requires text mining. Text mining aims to automatically extract facts from literature. Since authors write in natural language, text mining is a great natural language processing challenge, which is far from being solved. We propose an alternative: If authors and editors summarize the main facts in a controlled natural language, text mining will become easier and more powerful. To demonstrate this approach, we use the language Attempto Controlled English (ACE). We define a simple model to capture the main aspects of protein interactions. To evaluate our approach, we collected a dataset of 459 paragraph headings about protein interaction from literature. 56% of these headings can be represented exactly in ACE and another 23% partially. These results indicate that our approach is feasible.

show abstract

BioCreAtIvE Task 1A: gene mention finding evaluation

Cited by 148 publications

References 13 publications

An analysis of active learning strategies for sequence labeling tasks

An analysis of active learning strategies for sequence labeling tasks

Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern

Improving Text Mining with Controlled Natural Language: A Case Study for Protein Interactions

Contact Info

Product

Resources

About