IntroductionIt’s very necessary to predict the survival status of patients based on their prognosis. This can assist physicians in evaluating treatment decisions. Random Forest is an excellent machine learning algorithm even without any modification. We propose a new Random Forest weighting method and apply it to the gastric cancer patient data from the Surveillance, Epidemiology, and End Results (SEER) program, and then evaluated the generalization ability of this weighted Random Forest algorithm on 10 public medical datasets. Furthermore, for the same weighting mode, the difference between using out-of-bag (OOB) data and all training sets as the weighting basis is explored.Material and methods110697 cases of gastric cancer patients diagnosed between 1975 and 2016 obtained from the SEER database were contained in the experiment. In addition, 10 public medical datasets are used for the generalization ability evaluation of this weighted Random Forest algorithm.ResultsThrough experimental verification, on the SEER gastric cancer patient data, the weighted Random Forest algorithm improves the accuracy by 0.79% compared with the original Random Forest. In AUC, Macro-averaging increased by 2.32% and Micro-averaging increased by 0.51% on average. Among the 10 public datasets, the Random Forest weighted in accuracy has the best performance on 6 datasets, with an average increase of 1.44% in accuracy and an average increase of 1.2% in AUC.ConclusionsCompared with the original Random Forest, the weighted Random Forest model has a significant improvement in performance, and the effect of using all training data as the weighting basis is better than using OOB data.
The unprecedented coronavirus disease 2019 (COVID-19) pandemic is still raging (in year 2021) in many countries worldwide. Various response strategies to study the characteristics and distributions of the virus in various regions of the world have been developed to assist in the prevention and control of this epidemic. Descriptive statistics and regression analysis on COVID-19 data from different countries were conducted in this study to compare and evaluate various regression models. Results showed that the extreme random forest regression (ERFR) model had the best performance, and factors such as population density, ozone, median age, life expectancy, and Human Development Index (HDI) were relatively influential on the spread and diffusion of COVID-19 in the ERFR model.In addition, the epidemic clustering characteristics were analyzed through the spectral clustering algorithm. The visualization results of spectral clustering showed that the geographical distribution of global COVID-19 pandemic spread formation was highly clustered, and its clustering characteristics and influencing factors also exhibited some consistency in distribution. This study aims to deepen the understanding of the international community regarding the global COVID-19 pandemic to develop measures for countries worldwide to mitigate potential large-scale outbreaks and improve the ability to respond to such public health emergencies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.