Abstract. Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning-pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators-such as synthetic feature constructors-that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
The filamentous fungi are an ecologically important group of organisms which also have important industrial applications but devastating effects as pathogens and agents of food spoilage. Protein kinases have been implicated in the regulation of virtually all biological processes but how they regulate filamentous fungal specific processes is not understood. The filamentous fungus Aspergillus nidulans has long been utilized as a powerful molecular genetic system and recent technical advances have made systematic approaches to study large gene sets possible. To enhance A. nidulans functional genomics we have created gene deletion constructs for 9851 genes representing 93.3% of the encoding genome. To illustrate the utility of these constructs, and advance the understanding of fungal kinases, we have systematically generated deletion strains for 128 A. nidulans kinases including expanded groups of 15 histidine kinases, 7 SRPK (serine-arginine protein kinases) kinases and an interesting group of 11 filamentous fungal specific kinases. We defined the terminal phenotype of 23 of the 25 essential kinases by heterokaryon rescue and identified phenotypes for 43 of the 103 non-essential kinases. Uncovered phenotypes ranged from almost no growth for a small number of essential kinases implicated in processes such as ribosomal biosynthesis, to conditional defects in response to cellular stresses. The data provide experimental evidence that previously uncharacterized kinases function in the septation initiation network, the cell wall integrity and the morphogenesis Orb6 kinase signaling pathways, as well as in pathways regulating vesicular trafficking, sexual development and secondary metabolism. Finally, we identify ChkC as a third effector kinase functioning in the cellular response to genotoxic stress. The identification of many previously unknown functions for kinases through the functional analysis of the A. nidulans kinome illustrates the utility of the A. nidulans gene deletion constructs.
We present an extension of the two-class multifactor dimensionality reduction (MDR) algorithm that enables detection and characterization of epistatic SNP-SNP interactions in the context of a quantitative trait. The proposed Quantitative MDR (QMDR) method handles continuous data by modifying MDR’s constructive induction algorithm to use a T-test. QMDR replaces the balanced accuracy metric with a T-test statistic as the score to determine the best interaction model. We used a simulation to identify the empirical distribution of QMDR’s testing score. We then applied QMDR to genetic data from the ongoing prospective Prevention of Renal and Vascular End-Stage Disease (PREVEND) study.
SummaryA central goal of human genetics is to identify susceptibility genes for common human diseases. An important challenge is modelling gene-gene interaction or epistasis that can result in nonadditivity of genetic effects. The multifactor dimensionality reduction (MDR) method was developed as a machine learning alternative to parametric logistic regression for detecting interactions in the absence of significant marginal effects. The goal of MDR is to reduce the dimensionality inherent in modelling combinations of polymorphisms using a computational approach called constructive induction. Here, we propose a Robust Multifactor Dimensionality Reduction (RMDR) method that performs constructive induction using a Fisher's Exact Test rather than a predetermined threshold. The advantage of this approach is that only statistically significant genotype combinations are considered in the MDR analysis. We use simulation studies to demonstrate that this approach will increase the success rate of MDR when there are only a few genotype combinations that are significantly associated with case-control status. We show that there is no loss of success rate when this is not the case. We then apply the RMDR method to the detection of gene-gene interactions in genotype data from a population-based study of bladder cancer in New Hampshire.
With no known exceptions, every published microarray study to determine differential mRNA levels in eukaryotes used RNA extracted from whole cells. It is assumed that the use of whole cell RNA in microarray gene expression analysis provides a legitimate profile of steady-state mRNA. Standard labeling methods and the prevailing dogma that mRNA resides almost exclusively in the cytoplasm has led to the long-standing belief that the nuclear RNA contribution is negligible. We report that unadulterated cytoplasmic RNA uncovers differentially expressed mRNAs that otherwise would not have been detected when using whole cell RNA and that the inclusion of nuclear RNA has a large impact on whole cell gene expression microarray results by distorting the mRNA profile to the extent that a substantial number of false positives are generated. We conclude that to produce a valid profile of the steady-state mRNA population, the nuclear component must be excluded, and to arrive at a more realistic view of a cell's gene expression profile, the nuclear and cytoplasmic RNA fractions should be analyzed separately.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.