Data preprocessing remains an important step in machine learning studies. This is because proper preprocessing of imbalanced data can enable researchers to reduce defects as much as possible, which, in turn, may lead to the elimination of defects in existing data sets. Despite the remarkable achievements that have been accomplished in machine learning studies, systematic literature reviews of imbalanced data preprocessing techniques are lacking. Consequently, there are a limited number of systematic literature review studies on imbalanced data preprocessing. In this study, the authors assess the existing literature to identify the key issues related to data quality and handling and to provide a convenient collection of the techniques used to address these issues when performing data preprocessing. They applied a systematic literature review method involving a manual search to select articles published from January 2010 to September 2018 for review. The qualities of the existing studies were assessed using certain quality assessment criteria. Of the 118 relevant studies found, only 2% were identified as having been conducted following systematic literature review guidelines. This study, therefore, calls for more systematic literature review studies on data preprocessing to improve the quality of the data applied in machine learning studies.
Predicting the number of defects in software at the method level is important. However, little or no research has focused on method-level defect prediction. Therefore, considerable efforts are still required to demonstrate how method-level defect prediction can be achieved for a new software version. In the current study, we present an analysis of the relevant information obtained from the current version of a software product to construct regression models to predict the estimated number of defects in a new version using the variables of defect density, defect velocity and defect introduction time, which show considerable correlation with the number of method-level defects. These variables also show a mathematical relationship between defect density and defect acceleration at the method level, further indicating that the increase in the number of defects and the defect density are functions of the defect acceleration. We report an experiment conducted on the Finding Faults Using Ensemble Learners (ELFF) open-source Java projects, which contain 289,132 methods. The results show correlation coefficients of 60% for the defect density,-4% for the defect introduction time, and 93% for the defect velocity. These findings indicate that the average defect velocity shows a firm and considerable correlation with the number of defects at the method level. The proposed approach also motivates an investigation and comparison of the average performances of classifiers before and after method-level data preprocessing and of the level of entropy in the datasets.
Background Uncertainties surrounding the 2019 novel coronavirus (COVID-19) remain a major global health challenge and requires attention. Researchers and medical experts have made remarkable efforts to reduce the number of cases and prevent future outbreaks through vaccines and other measures. However, there is little evidence on how severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection entropy can be applied in predicting the possible number of infections and deaths. In addition, more studies on how the COVID-19 infection density contributes to the rise in infections are needed. This study demonstrates how the SARS-COV-2 daily infection entropy can be applied in predicting the number of infections within a given period. In addition, the infection density within a given population attributes to an increase in the number of COVID-19 cases and, consequently, the new variants. Results Using the COVID-19 initial data reported by Johns Hopkins University, World Health Organization (WHO) and Global Initiative on Sharing All Influenza Data (GISAID), the result shows that the original SAR-COV-2 strain has R0<1 with an initial infection growth rate entropy of 9.11 bits for the United States (U.S.). At close proximity, the average infection time for an infected individual to infect others within a susceptible population is approximately 7 minutes. Assuming no vaccines were available, in the U.S., the number of infections could range between 41,220,199 and 82,440,398 in late March 2022 with approximately, 1,211,036 deaths. However, with the available vaccines, nearly 48 Million COVID-19 cases and 706, 437 deaths have been prevented. Conclusion The proposed technique will contribute to the ongoing investigation of the COVID-19 pandemic and a blueprint to address the uncertainties surrounding the pandemic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.