Mining phenotypes for gene function prediction

Groth, Philip; Weiß, Bertram; Pohlenz, Hans-Dieter; Leser, Ulf

doi:10.1186/1471-2105-9-136

Cited by 41 publications

(28 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The researcher collects many variables on each study subject and then wants to identify the variables that have an influence on the outcome variable. This problem becomes especially pronounced with modern high-throughput experiments where the number of variables p is often much larger than the number of observations n (e.g., genomics, transcriptomics, proteomics, metabolomics, metabonomics and phenomics; see, [1-6]) or in complex modeling situations with many potential predictors, where the aim is to find a meaningful non-linear model (see e.g., [7]). One of the major aims in the analysis of these high-dimensional data sets is to detect the signal variables S , while controlling the number of selected noise variables N .…”

Section: Introductionmentioning

confidence: 99%

Controlling false discoveries in high-dimensional situations: boosting with stability selection

2015

View full text Add to dashboard Cite

BackgroundModern biotechnologies often result in high-dimensional data sets with many more variables than observations (n≪p). These data sets pose new challenges to statistical analysis: Variable selection becomes one of the most important tasks in this setting. Similar challenges arise if in modern data sets from observational studies, e.g., in ecology, where flexible, non-linear models are fitted to high-dimensional data. We assess the recently proposed flexible framework for variable selection called stability selection. By the use of resampling procedures, stability selection adds a finite sample error control to high-dimensional variable selection procedures such as Lasso or boosting. We consider the combination of boosting and stability selection and present results from a detailed simulation study that provide insights into the usefulness of this combination. The interpretation of the used error bounds is elaborated and insights for practical data analysis are given.ResultsStability selection with boosting was able to detect influential predictors in high-dimensional settings while controlling the given error bound in various simulation scenarios. The dependence on various parameters such as the sample size, the number of truly influential variables or tuning parameters of the algorithm was investigated. The results were applied to investigate phenotype measurements in patients with autism spectrum disorders using a log-linear interaction model which was fitted by boosting. Stability selection identified five differentially expressed amino acid pathways.ConclusionStability selection is implemented in the freely available R package stabs (http://CRAN.R-project.org/package=stabs). It proved to work well in high-dimensional settings with more predictors than observations for both, linear and additive models. The original version of stability selection, which controls the per-family error rate, is quite conservative, though, this is much less the case for its improvement, complementary pairs stability selection. Nevertheless, care should be taken to appropriately specify the error bound.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0575-3) contains supplementary material, which is available to authorized users.

show abstract

Section: Introductionmentioning

confidence: 99%

Controlling false discoveries in high-dimensional situations: boosting with stability selection

2015

View full text Add to dashboard Cite

show abstract

“…In our own early work on using text for characterizing gene's function, we have introduced the use of probabilistic topic models applied to PubMed abstracts for representing sets of genes sharing a common function [53]. Van Driel et al [16] later use a similar idea for grouping and characterizing genes, by identifying similarities among the text describing their respective phenotypes, obtained from OMIM; Groth et al [21,22] also approach phenotype-based study of genes by applying a clustering technique to the textdescriptions of phenotypes, and associating text and keywords within it with GO categories. A text-based classification system by Stapley et al [57] used support vector machines to assign yeast proteins to subcellular locations; Nenadic et al [36] used a similar approach to annotate proteins with one of 11 biological process terms from the upper levels of the GO hierarchy.…”

Section: Introductionmentioning

confidence: 99%

Text as data: Using text-based features for proteins representation and for computational prediction of their characteristics

Shatkay

Brady

Wong

2015

Methods

View full text Add to dashboard Cite

The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining. Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text. In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.

show abstract

“…This would seem equally important for ecotoxicology and should be encouraged. In the interim, approaches developed for phenotype clustering (phenoclustering) based on automated literature searching using semantic (text) clustering tools [116] may have some value for assisting in AOP development.…”

Section: Mining the Extant Literature For Relevant Informationmentioning

confidence: 99%

Defining and modeling known adverse outcome pathways: Domoic acid and neuronal signaling as a case study

Watanabe

Andersen

Basu

et al. 2010

Enviro Toxic and Chemistry

View full text Add to dashboard Cite

An adverse outcome pathway (AOP) is a sequence of key events from a molecular-level initiating event and an ensuing cascade of steps to an adverse outcome with population-level significance. To implement a predictive strategy for ecotoxicology, the multiscale nature of an AOP requires computational models to link salient processes (e.g., in chemical uptake, toxicokinetics, toxicodynamics, and population dynamics). A case study with domoic acid was used to demonstrate strategies and enable generic recommendations for developing computational models in an effort to move toward a toxicity testing paradigm focused on toxicity pathway perturbations applicable to ecological risk assessment. Domoic acid, an algal toxin with adverse effects on both wildlife and humans, is a potent agonist for kainate receptors (ionotropic glutamate receptors whose activation leads to the influx of Na(+) and Ca²(+)). Increased Ca²(+) concentrations result in neuronal excitotoxicity and cell death, primarily in the hippocampus, which produces seizures, impairs learning and memory, and alters behavior in some species. Altered neuronal Ca²(+) is a key process in domoic acid toxicity, which can be evaluated in vitro. Furthermore, results of these assays would be amenable to mechanistic modeling for identifying domoic acid concentrations and Ca²(+) perturbations that are normal, adaptive, or clearly toxic. In vitro assays with outputs amenable to measurement in exposed populations can link in vitro to in vivo conditions, and toxicokinetic information will aid in linking in vitro results to the individual organism. Development of an AOP required an iterative process with three important outcomes: a critically reviewed, stressor-specific AOP; identification of key processes suitable for evaluation with in vitro assays; and strategies for model development.

show abstract

Mining phenotypes for gene function prediction

Cited by 41 publications

References 45 publications

Controlling false discoveries in high-dimensional situations: boosting with stability selection

Controlling false discoveries in high-dimensional situations: boosting with stability selection

Text as data: Using text-based features for proteins representation and for computational prediction of their characteristics

Defining and modeling known adverse outcome pathways: Domoic acid and neuronal signaling as a case study

Contact Info

Product

Resources

About