In this paper, we introduce a new type of tree-based method, reinforcement learning trees (RLT), which exhibits significantly improved performance over traditional methods such as random forests (Breiman, 2001) under high-dimensional settings. The innovations are three-fold. First, the new method implements reinforcement learning at each selection of a splitting variable during the tree construction processes. By splitting on the variable that brings the greatest future improvement in later splits, rather than choosing the one with largest marginal effect from the immediate split, the constructed tree utilizes the available samples in a more efficient way. Moreover, such an approach enables linear combination cuts at little extra computational cost. Second, we propose a variable muting procedure that progressively eliminates noise variables during the construction of each individual tree. The muting procedure also takes advantage of reinforcement learning and prevents noise variables from being considered in the search for splitting rules, so that towards terminal nodes, where the sample size is small, the splitting rules are still constructed from only strong variables. Last, we investigate asymptotic properties of the proposed method under basic assumptions and discuss rationale in general settings.
We propose recursively imputed survival tree (RIST) regression for right-censored data. This new nonparametric regression procedure uses a novel recursive imputation approach combined with extremely randomized trees that allows significantly better use of censored data than previous tree based methods, yielding improved model fit and reduced prediction error. The proposed method can also be viewed as a type of Monte Carlo EM algorithm which generates extra diversity in the tree-based fitting process. Simulation studies and data analyses demonstrate the superior performance of RIST compared to previous methods.
In multidimensional cancer omics studies, one subject is profiled on multiple layers of omics activities. In this article, the goal is to integrate multiple types of omics measurements, identify markers, and build a model for cancer outcome. The proposed analysis is achieved in two steps. In the first step, we analyze the regulation among different types of omics measurements, through the construction of linear regulatory modules (LRMs). The LRMs have sound biological basis, and their construction differs from the existing analyses by modeling the regulation of sets of gene expressions (GEs) by sets of regulators. The construction is realized with the assistance of regularized singular value decomposition. In the second step, the proposed cancer outcome model includes the regulated GEs, "residuals" of GEs, and "residuals" of regulators, and we use regularized estimation to select relevant markers. Simulation shows that the proposed method outperforms the alternatives with more accurate marker identification. We analyze the The Cancer Genome Atlas data on cutaneous melanoma and lung adenocarcinoma and obtain meaningful results.
Sepsis is a leading cause of death and is the most expensive condition to treat in U.S. hospitals. Despite targeted efforts to automate earlier detection of sepsis, current techniques rely exclusively on using either standard clinical data or novel biomarker measurements. In this study, we apply machine learning techniques to assess the predictive power of combining multiple biomarker measurements from a single blood sample with electronic medical record data (EMR) for the identification of patients in the early to peak phase of sepsis in a large community hospital setting. Combining biomarkers and EMR data achieved an area under the receiver operating characteristic (ROC) curve (AUC) of 0.81, while EMR data alone achieved an AUC of 0.75. Furthermore, a single measurement of six biomarkers (IL-6, nCD64, IL-1ra, PCT, MCP1, and G-CSF) yielded the same predictive power as collecting an additional 16 hours of EMR data(AUC of 0.80), suggesting that the biomarkers may be useful for identifying these patients earlier. Ultimately, supervised learning using a subset of biomarker and EMR data as features may be capable of identifying patients in the early to peak phase of sepsis in a diverse population and may provide a tool for more timely identification and intervention.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.