Objective
Predicting daily trends in the Coronavirus Disease 2019 (COVID-19) case number is important to support individual decisions in taking preventative measures. This study aims to use COVID-19 case number history, demographic characteristics, and social distancing policies both independently/interdependently to predict the daily trend in the rise or fall of county-level cases.
Materials and Methods
We extracted 2,093 features (5 from the U.S. COVID-19 case number history, 1,824 from the demographic characteristics independently/interdependently, and 264 from the social distancing policies independently/interdependently) for 3,142 United States counties. Using the top selected 200 features, we built four machine learning models: Logistic Regression, Naïve Bayes, Multi-Layer Perceptron, and Random Forest, along with four Ensemble methods: Average, Product, Minimum, and Maximum, and compared their performances.
Results
The Ensemble Average method had the highest Area-Under the receiver operator characteristic Curve (AUC) of 0.692. The top ranked features were all interdependent features. Our feature analysis showed demographics (e.g., white alone males, black alone females, and higher age groups) and social distancing policies (e.g., quarantine rules, gathering sizes, and declaration of emergency) to be the most impactful.
Conclusion
The findings of this study suggest the predictive power of diverse features, especially when combined, in predicting county-level trends of COVID-19 cases and can be helpful to individuals in making their daily decisions. Our results may guide future studies to consider more features interdependently from conventionally distinct data sources in county-level predictive models. Our code is available at: https://doi.org/10.5281/zenodo.6332944.
Lay Summary
Predicting the Coronavirus Disease 2019 (COVID-19) daily trend is important to support individual decisions in taking preventative measures. This study aims to utilize COVID-19 case number history, population demographic characteristics, and social distancing policies to predict the trend in the rise or fall of county-level cases in the United States, with a unique aspect of using predictors from data sources that are conventionally not seen to be combined with each other. Using the top 200 selected features among 2,093 ones for 3,142 United States counties, we built four machine learning models, along with four ensemble methods, and compared their performances. We achieved relatively reasonable prediction and calibration results across all constructed models, with comparatively negligible runtimes. Our feature analysis showed the most impactful predictors to be features derived from combining independent ones. The findings of this study suggest the importance of diverse features in predicting county-level trends of COVID-19 cases within the United States when they are combined across traditionally distinct domains. Our results may guide future studies to consider more diverse features in predictive models.