BackgroundThe task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.ResultsIn this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.ConclusionsLINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.
BackgroundWe report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).ResultsWe received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.ConclusionsBy using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.
Nephronophthisis (NPHP) is an autosomal recessive cystic kidney disease, caused by mutations of at least nine different genes. Several extrarenal manifestations characterize this disorder, including cerebellar defects, situs inversus and retinitis pigmentosa. While the clinical manifestations vary significantly in NPHP, mutations of NPHP5 and NPHP6 are always associated with progressive blindness. This clinical finding suggests that the gene products, nephrocystin-5 and nephrocystin-6, participate in overlapping signaling pathways to maintain photoreceptor homeostasis. To analyze the genetic interaction between these two proteins in more detail, we studied zebrafish embryos after depletion of NPHP5 and NPHP6. Knockdown of zebrafish zNPHP5 and zNPHP6 produced similar phenotypes, and synergistic effects were observed after the combined knockdown of zNPHP5 and zNPHP6. The N-terminal domain of nephrocystin-6-bound nephrocystin-5, and mapping studies delineated the interacting site from amino acid 696 to 896 of NPHP6. In Xenopus laevis, knockdown of NPHP5 caused substantial neural tube closure defects. This phenotype was copied by expression of the nephrocystin-5-binding fragment of nephrocystin-6, and rescued by co-expression of nephrocystin-5, supporting a physical interaction between both gene products in vivo. Since the N- and C-terminal fragments of nephrocystin-6 engage in the formation of homo- and heteromeric protein complexes, conformational changes seem to regulate the interaction of nephrocystin-6 with its binding partners.
Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987.Availability: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net.Contact: jorg.hakenberg@roche.com
Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative.Availability: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.Contact: martin.gerner@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.
BackgroundThe last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.Methodology/Principal FindingsTo overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.Conclusion/SignificanceBy allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.
Introduction: Sustainability constitutes an essential element in corporate contexts by now. Corporate sustainability may be addressed differently, either selectively or integrated. The holistic approach opted for primarily focuses on conveying aspects of corporate sustainability across different socio-cultural contexts. In doing so, the full quadruple scope of sustainability dimensions is considered cornerstone of a well-chosen case-study design. Such particular reference prototypically exemplifies business models in highly volatile commodity markets. Continuously longing for optimization and rationalization due to creativity and productivity gains, the company is predominantly challenged by changing customer needs resulting from digitization as mega trend of the twenty-first century. Case description: Alluding to the business of stationery and professional office supplies, this case study aims at advocating the systematic proceeding of how to strengthen corporate sustainability through both embracing existing and evolving culture-bound approaches of sustainability vis-à-vis a framework implementation strategy and related guidelines for implementation at operational level. The essential quest is how to convey aspects of corporate sustainability across different socio-cultural contexts? Aiming at contributing to some kind of framework implementation strategy this implies reflecting upon assessing the status quo of corporate sustainability-related strategic approaches, activities and initiatives, means and instruments (sustainability performance); identifying universal and culture-bound drivers (sustainability opportunity); and deducing operational guidelines towards stakeholder awareness, selected strategic options, projects and best practices (sustainability commitment). Given the theoretical background of corporate sustainability and related contexts, impetus for operationalizing is created by triple means of: ▪ Adopting an extended and generic life-cycle assessment (LCA) as core analytic framework and structuring element, corresponding with the value-chain-oriented approach of business segments, ▪ Applying the multi-method paradigm of methodological triangulation, including stakeholder observation, expert interviews and company resources, and ▪ Framing major characteristics as case study according to principles of qualitative content analysis.
Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments.Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data.Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org.Contact: maximilianh@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.