Background Biomarker identification is one of the major and important goal of functional genomics and translational medicine studies. Large scale –omics data are increasingly being accumulated and can provide vital means for the identification of biomarkers for the early diagnosis of complex disease and/or for advanced patient/diseases stratification. These tasks are clearly interlinked, and it is essential that an unbiased and stable methodology is applied in order to address them. Although, recently, many, primarily machine learning based, biomarker identification approaches have been developed, the exploration of potential associations between biomarker identification and the design of future experiments remains a challenge. Methods In this study, using both simulated and published experimentally derived datasets, we assessed the performance of several state-of-the-art Random Forest (RF) based decision approaches, namely the Boruta method, the permutation based feature selection without correction method, the permutation based feature selection with correction method, and the backward elimination based feature selection method. Moreover, we conducted a power analysis to estimate the number of samples required for potential future studies. Results We present a number of different RF based stable feature selection methods and compare their performances using simulated, as well as published, experimentally derived, datasets. Across all of the scenarios considered, we found the Boruta method to be the most stable methodology, whilst the Permutation (Raw) approach offered the largest number of relevant features, when allowed to stabilise over a number of iterations. Finally, we developed and made available a web interface (https://joelarkman.shinyapps.io/PowerTools/) to streamline power calculations thereby aiding the design of potential future studies within a translational medicine context. Conclusions We developed a RF-based biomarker discovery framework and provide a web interface for our framework, termed PowerTools, that caters the design of appropriate and cost-effective subsequent future omics study.
Background Biomarker identification is one of the major and important goal of the functional genomics and translational medicine remits. Large scale –omics data are increasing being accumulated and can provide vital means for the identification of biomarkers for the early diagnosis of complex disease and/or patient/diseases stratification for prospective studies. These tasks are clearly interlinked and it is essential that an unbiased and stable methodology is applied in order to address them. Although, recently, many, primarily machine learning based, biomarker identification approaches have been developed, the exploration of potential associations between biomarker identification and the design of future experiments remains a challenge. Methods In this study, using both simulated and published experimentally derived (real) datasets. We compared the performance of decision based machine learning approach called Random Forest. Four Random forest based feature selection methods namely, Boruta, Permutation based feature selection without correction, Permutation based feature selection with correction, Backward elimination based feature selection. Moreover, we conducted power analysis to estimate the number of samples required for potential future studies using the derived stable from the previous step. Results We presented a number of different RF based stable feature selection methods and compared their performances using simulated as well as published experimentally derived datasets. Across all of the scenarios considered, we found Boruta to be the most stable methodology, whilst Permutation (Raw) offered the largest number of relevant features when allowed to stabilise over a number of iterations. Finally, we developed a web interface (https://joelarkman.shinyapps.io/PowerTools/) to streamline power calculations and aid future study design within a translational medicine context. Conclusions We developed a pipeline to discover biomarkers using RF methods. The web interface, “PowerTools” offers the potential for designing appropriate and cost-effective subsequent future omics study designs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.