Motivation Automated machine learning (AutoML) systems are helpful data science assistants designed to scan data for novel features, select appropriate supervised learning models and optimize their parameters. For this purpose, Tree-based Pipeline Optimization Tool (TPOT) was developed using strongly typed genetic programing (GP) to recommend an optimized analysis pipeline for the data scientist’s prediction problem. However, like other AutoML systems, TPOT may reach computational resource limits when working on big data such as whole-genome expression data. Results We introduce two new features implemented in TPOT that helps increase the system’s scalability: Feature Set Selector (FSS) and Template. FSS provides the option to specify subsets of the features as separate datasets, assuming the signals come from one or more of these specific data subsets. FSS increases TPOT’s efficiency in application on big data by slicing the entire dataset into smaller sets of features and allowing GP to select the best subset in the final pipeline. Template enforces type constraints with strongly typed GP and enables the incorporation of FSS at the beginning of each pipeline. Consequently, FSS and Template help reduce TPOT computation time and may provide more interpretable results. Our simulations show TPOT-FSS significantly outperforms a tuned XGBoost model and standard TPOT implementation. We apply TPOT-FSS to real RNA-Seq data from a study of major depressive disorder. Independent of the previous study that identified significant association with depression severity of two modules, TPOT-FSS corroborates that one of the modules is largely predictive of the clinical diagnosis of each individual. Availability and implementation Detailed simulation and analysis code needed to reproduce the results in this study is available at https://github.com/lelaboratoire/tpot-fss. Implementation of the new TPOT operators is available at https://github.com/EpistasisLab/tpot. Supplementary information Supplementary data are available at Bioinformatics online.
These findings support the clinical utility of a massively parallel sequencing panel for craniosynostosis. TCF12 and EFNB1 should be included in genetic testing for nonsyndromic coronal craniosynostosis or clinically suspected Saethre-Chotzen syndrome.
Craniosynostosis is one of the most common craniofacial disorders encountered in clinical genetics practice, with an overall incidence of 1 in 2,500. Between 30% and 70% of syndromic craniosynostoses are caused by mutations in hotspots in the fibroblast growth factor receptor (FGFR) genes or in the TWIST1 gene with the difference in detection rates likely to be related to different study populations within craniofacial centers. Here we present results from molecular testing of an Australia and New Zealand cohort of 630 individuals with a diagnosis of craniosynostosis. Data were obtained by Sanger sequencing of FGFR1, FGFR2, and FGFR3 hotspot exons and the TWIST1 gene, as well as copy number detection of TWIST1. Of the 630 probands, there were 231 who had one of 80 distinct mutations (36%). Among the 80 mutations, 17 novel sequence variants were detected in three of the four genes screened. In addition to the proband cohort there were 96 individuals who underwent predictive or prenatal testing as part of family studies. Dysmorphic features consistent with the known FGFR1-3/TWIST1-associated syndromes were predictive for mutation detection. We also show a statistically significant association between splice site mutations in FGFR2 and a clinical diagnosis of Pfeiffer syndrome, more severe clinical phenotypes associated with FGFR2 exon 10 versus exon 8 mutations, and more frequent surgical procedures in the presence of a pathogenic mutation. Targeting gene hot spot areas for mutation analysis is a useful strategy to maximize the success of molecular diagnosis for individuals with craniosynostosis.
Summary Machine learning feature selection methods are needed to detect complex interaction-network effects in complicated modeling scenarios in high-dimensional data, such as GWAS, gene expression, eQTL and structural/functional neuroimage studies for case–control or continuous outcomes. In addition, many machine learning methods have limited ability to address the issues of controlling false discoveries and adjusting for covariates. To address these challenges, we develop a new feature selection technique called Nearest-neighbor Projected-Distance Regression (NPDR) that calculates the importance of each predictor using generalized linear model regression of distances between nearest-neighbor pairs projected onto the predictor dimension. NPDR captures the underlying interaction structure of data using nearest-neighbors in high dimensions, handles both dichotomous and continuous outcomes and predictor data types, statistically corrects for covariates, and permits statistical inference and penalized regression. We use realistic simulations with interactions and other effects to show that NPDR has better precision-recall than standard Relief-based feature selection and random forest importance, with the additional benefit of covariate adjustment and multiple testing correction. Using RNA-Seq data from a study of major depressive disorder (MDD), we show that NPDR with covariate adjustment removes spurious associations due to confounding. We apply NPDR to eQTL data to identify potentially interacting variants that regulate transcripts associated with MDD and demonstrate NPDR’s utility for GWAS and continuous outcomes. Availability and implementation Available at: https://insilico.github.io/npdr/. Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.