Mainstream machine learning approaches to predictive analytics consistently prove their ability to perform well using a variety of datasets, although the task of identifying an optimally-performing machine learning approach for any given dataset becomes much less intuitive. Methods such as ensemble and transformation modeling have been developed to improve upon individual base learners and datasets with large degrees of variance. Despite the increased generalizability and flexibility of ensemble approaches, the cost often involves sacrificing inference for predictive ability. This paper introduces an alternative approach to ensemble modeling, combining the predictive ability of an ensemble framework with localized model construction through the incorporation of cluster analysis as a pre-processing technique. The workflow not only outperforms independent base learners and comparative ensemble methods, but also preserves local inferential capability by manipulating cluster parameters and maintaining interpretable relative importance values and non-transformed coefficients for the overall consideration of variable importance. This paper demonstrates the ensemble technique on a dataset to estimate rates of health insurance coverage across the state of Missouri, where the cluster pre-processing assists in understanding both local and global variable importance and interactions when predicting high concentration areas of low health insurance coverage based on demographic, socioeconomic, and geospatial variables.
In a previous study, Mueller et al. (ISPRS Int J Geo-Inf 8(1):13, 2019), presented a machine learning ensemble algorithm using K-means clustering as a preprocessing technique to increase predictive modeling performance. As a follow-on research effort, this study seeks to test the previously introduced algorithm's stability and sensitivity, as well as present an innovative method for the extraction of localized and state-level variable importance information from the original dataset, using a nontraditional method known as synthetic population generation. Through iterative synthetic population generation with similar underlying statistical properties to the original dataset and exploration of the distribution of health insurance coverage across the state of Missouri, we identified variables that contributed to decisions for clustering, variables that contributed most significantly to modeling health insurance distribution status throughout the state, and variables that were most influential in optimizing model performance, having the greatest impact on change-in-meansquared-error (MSE) measurements. Results suggest that cluster-based preprocessing approaches for machine learning algorithms can result in significantly increased performance, and also demonstrate how synthetic populations can be used for performance measurement to identify and test the extent to which variable statistical properties within a dataset can vary without resulting in significant performance loss.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.