Inter-species normalization of gene mentions with GNAT

Hakenberg, Jörg; Plake, Conrad; Leaman, Robert; Schroeder, Michael; González, Graciela

doi:10.1093/bioinformatics/btn299

Cited by 96 publications

(86 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, we backed off from some approaches presented previously, such as GNAT for entity mention normalization and an alignment-based pattern matching algorithm (see [20], [19]). The focus of all building blocks in our systems is on annotation time, and our goal was to provide a service that can handle a full-text article in 10 seconds and still have reasonable accuracy; our most accurate composition/tuning parameters should still be able to analyze a full text in about two minutes.…”

Section: Methodsmentioning

confidence: 99%

“…NER systems in recent years have tended towards primarily employing machine learning techniques, including conditional random fields, due to the consistently high performance these techniques provide when trained on a high-quality corpora such as the BioCreative II gene mention corpus [22]. In contrast, however, systems for identifying proteins (also called normalization and grounding, EMN) have largely focused on dictionary-based techniques; some notable recent systems include GNAT [20] and GeNo [23]. For BioCreative II.5, we have identified several attributes which are useful for supporting the identification of proteins found, including: 1) the ability to associate a confidence with each mention found; 2) improving consistency via enforcing a one-senseper-document assumption; 3) generating a list of candidate proteins (identifications) to which each mention could refer.…”

Section: Named Entity Recognition and Identificationmentioning

confidence: 99%

“…We then narrow down each list of IDs per protein mention by species. We have shown previously [19], [20] how genes/proteins can be disambiguated by using context profiles, that is, information available on each protein that can be compared to the current text and measure the overlap in GO terms, disease associations, tissue specificity, chromosomal location of genes, protein length and mass, and so on. For efficiency required in the online tasks in BioCreative II.5, however, we do not perform actual disambiguation, but rank IDs by dictionary match and species only.…”

Section: Entity Mention Normalizationmentioning

confidence: 99%

“…For both INT and IPT this can simply be achieved by submitting multiple identifiers per protein, thus leading to hundreds of UniProt IDs per article. The highest AUC iP/R submission of 43.5% contained an average of 83 IDs for each of the 252 relevant proteins (20,888 IDs in total for the 61 relevant documents, 342 IDs on average per document), with a resulting precision of 1.2%. For a database curation scenario, narrowing down the number of IDs per protein/article might prove suitable, for curators as well as authors writing an SDA; thus, we focus on reporting F-scores in the remainder.…”

Section: Ranking Of Extracted Pairs and Proteinsmentioning

confidence: 99%

“…For related work, we refer the reader to the other articles published in this special issue, together with the overview paper by Hirschman et al [15]. Some of the approaches we discuss build on earlier systems described in [19] (gene mention normalization; pattern generation) and [20] (inter-species gene mention normalization). We discuss related work inline with the methods, data sets, and discussion where appropriate.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

Hakenberg¹,

Leaman²,

Vo³

et al. 2010

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Abstract-Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research articles, which in turn get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant articles, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment, to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein named entity recognition including normalization, and for extraction of proteinprotein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between ten seconds and two minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22% for finding protein interactions, and 43% for mapping proteins to UniProt IDs; disregarding species, f-scores are 30% and 55%, respectively. On average, our best-performing setup required around two minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Named Entity Recognition and Identificationmentioning

confidence: 99%