Accurately identifying groundwater contamination sites is vital for groundwater protection and restoration. This study aims to use a machine learning (ML) approach to identify groundwater contamination sites with total petroleum hydrocarbons (TPH) as target contaminants in a case study of gas stations in China. Firstly, six classical ML algorithms, including logistic regression, decision tree, gradient boosting decision tree (GBDT), random forest, multi-layer perceptron, and support vector machine, were applied to develop the identification models of TPH-contaminated groundwater with 40 features and the performances were compared. The comparison results showed that the GBDT model achieves the best prediction performance, with F1 score of 1 and AUC value of 1. Next, Bayesian optimization optimized GBDT (BO-GBDT) was conducted to further decrease the training time from 19,125 s to 513 s while maintaining the same prediction performance (F1 score = 1, AUC = 1). Finally, Shapley additive explanations (SHAP) analysis was performed on the BO-GBDT model. The SHAP results displayed that the critical feature variables in the BO-GBDT model include wind, population, evaporation, total potassium in the soil, precipitation, and leakage accident. This study demonstrated that BO-GBDT is one satisfactory model to identify groundwater TPH-contamination at gas stations. The method proposed in this study has the potential to be applied to other types of groundwater contamination sites.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.