Bayesian additive regression trees (BART) is a flexible prediction model/machine learning approach that has gained widespread popularity in recent years. As BART becomes more mainstream, there is an increased need for a paper that walks readers through the details of BART, from what it is to why it works. This tutorial is aimed at providing such a resource. In addition to explaining the different components of BART using simple examples, we also discuss a framework, the General BART model that unifies some of the recent BART extensions, including semiparametric models, correlated outcomes, and statistical matching problems in surveys, and models with weaker distributional assumptions. By showing how these models fit into a single framework, we hope to demonstrate a simple way of applying BART to research problems that go beyond the original independent continuous or binary outcomes framework.
The development of driverless vehicles has spurred the need to predict human driving behavior to facilitate interaction between driverless and human-driven vehicles.Predicting human driving movements can be challenging, and poor prediction models can lead to accidents between the driverless and human-driven vehicles. We used the vehicle speed obtained from a naturalistic driving dataset to predict whether a human-driven vehicle would stop before executing a left turn. In a preliminary analysis, we found that BART produced less variable and higher AUC values compared to a variety of other state-of-the-art binary predictor methods. However, BART assumes independent observations, but our dataset consists of multiple observations clustered by driver. Although methods extending BART to clustered or longitudinal data are available, they lack readily available software and can only be applied to clustered continuous outcomes. We extend BART to handle correlated binary observations by adding a random intercept and used a simulation study to determine bias, root mean squared error, 95% coverage, and average length of 95% credible interval in a correlated data setting. We then successfully implemented our random intercept BART model to 1 arXiv:1609.07464v2 [stat.AP] 1 May 2017 our clustered dataset and found substantial improvements in prediction performance compared to BART and random intercept linear logistic regression.
Examples of "doubly robust" estimator for missing data include augmented inverse probability weighting (AIPWT) models (Robins et al., 1994) and penalized splines of propensity prediction (PSPP) models (Zhang and Little, 2009). Doubly-robust estimators have the property that, if either the response propensity or the mean is modeled correctly, a consistent estimator of the population mean is obtained. However, doubly-robust estimators can perform poorly when modest misspecification is present in both models (Kang and Schafer, 2007). Here we consider extensions of the AIPWT and PSPP models that use Bayesian Additive Regression Trees (BART; Chipman et al., 2010) to provide highly robust propensity and mean model estimation. We term these "robust-squared" in the sense that the propensity score, the means, or both can be estimated with minimal model misspecification, and applied to the doubly-robust estimator. We consider their behavior via simulations where propensities and/or mean models are misspecified. We apply our proposed method to impute missing instantaneous velocity (delta-v) values from the 2014 National Automotive Sampling System Crashworthiness Data System dataset and missing Blood Alcohol Concentration values from the 2015 Fatality Analysis Reporting System dataset. We found that BART applied to PSPP and AIPWT, provides a more robust and efficient estimate compared to PSPP and AIPWT, with the BART-estimated propensity score combined with PSPP providing the most efficient estimator with close to nominal coverage.
The Health and Retirement Study (HRS) is a longitudinal study of U.S. adults enrolled at age 50 and older. We were interested in investigating the effect of a sudden large decline in wealth on the cognitive ability of subjects measured using a dataset provided composite score. However, our analysis was complicated by the lack of randomization, time‐dependent confounding, and a substantial fraction of the sample and population will die during follow‐up leading to some of our outcomes being censored. The common method to handle this type of problem is marginal structural models (MSM). Although MSM produces valid estimates, this may not be the most appropriate method to reflect a useful real‐world situation because MSM upweights subjects who are more likely to die to obtain a hypothetical population that over time, resembles that would have been obtained in the absence of death. A more refined and practical framework, principal stratification (PS), would be to restrict analysis to the strata of the population that would survive regardless of negative wealth shock experience. In this work, we propose a new algorithm for the estimation of the treatment effect under PS by imputing the counterfactual survival status and outcomes. Simulation studies suggest that our algorithm works well in various scenarios. We found no evidence that a negative wealth shock experience would affect the cognitive score of HRS subjects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.