Many studies have attempted to predict chlorophyll-a concentrations using multiple regression models and validating them with a hold-out technique. In this study commonly used machine learning models, such as Support Vector Regression, Bagging, Random Forest, Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), and Long–Short-Term Memory (LSTM), are used to build a new model to predict chlorophyll-a concentrations in the Nakdong River, Korea. We employed 1–step ahead recursive prediction to reflect the characteristics of the time series data. In order to increase the prediction accuracy, the model construction was based on forward variable selection. The fitted models were validated by means of cumulative learning and rolling window learning, as opposed to the hold–out technique. The best results were obtained when the chlorophyll-a concentration was predicted by combining the RNN model with the rolling window learning method. The results suggest that the selection of explanatory variables and 1–step ahead recursive prediction in the machine learning model are important processes for improving its prediction performance.
This study adopts two approaches to analyze the occurrence of algae at Haman Weir for Nakdong River; one is the traditional statistical method, such as logistic regression, while the other is machine learning technique, such as kNN, ANN, RF, Bagging, Boosting, and SVM. In order to compare the performance of the models, this study measured the accuracy, specificity, sensitivity, and AUC, which are representative model evaluation tools. The ROC curve is created by plotting association of sensitivity and (1-specificity). The AUC that is area of ROC curve represents sensitivity and specificity. This measure has two competitive advantages compared to other evaluation tools. One is that it is scale-invariant. It means that purpose of AUC is how well the model predicts. The other is that the AUC is classification-threshold-invariant. It shows that the AUC is independent of threshold because it is plotted association of sensitivity and (1-specificity) obtained by threshold. We chose AUC as a final model evaluation tool with two advantages. Also, variable selection was conducted using the Boruta algorithm. In addition, we tried to distinguish the better model by comparing the model with the variable selection method and the model without the variable selection method. As a result of the analysis, Boruta algorithm as a variable selection method suggested PO4-P, DO, BOD, NH3-N, Susp, pH, TOC, Temp, TN, and TP as significant explanatory variables. A comparison was made between the model with and without these selected variables. Among the models without variable selection method, the accuracy of RF analysis was highest, and ANN analysis showed the highest AUC. In conclusion, ANN analysis using the variable selection method showed the best performance among the models with and without variable selection method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.