Machine Learning studies often involve a series of computational experiments in which the predictive performance of multiple models are compared across one or more datasets. The results obtained are usually summarized through average statistics, either in numeric tables or simple plots. Such approaches fail to reveal interesting subtleties about algorithmic performance, including which observations an algorithm may find easy or hard to classify, and also which observations within a dataset may present unique challenges. Recently, a methodology known as Instance Space Analysis was proposed for visualizing algorithm performance across different datasets. This methodology relates predictive performance to estimated instance hardness measures extracted from the datasets. However, the analysis considered an instance as being an entire classification dataset and the algorithm performance was reported for each dataset as an average error across all observations in the dataset. In this paper, we developed a more fine-grained analysis by adapting the ISA methodology. The adapted version of ISA allows the analysis of an individual classification dataset by a 2-D hardness embedding, which provides a visualization of the data according to the difficulty level of its individual observations. This allows deeper analyses of the relationships between instance hardness and predictive performance of classifiers. We also provide an open-access Python package named PyHard, which encapsulates the adapted ISA and provides an interactive visualization interface. We illustrate through case studies how our tool can provide insights about data quality and algorithm performance in the presence of challenges such as noisy and biased data.
Many efforts were made by the scientific community during the Covid-19 pandemic to understand the disease and better manage health systems' resources. Believing that city and population characteristics influence how the disease spreads and develops, we used Machine Learning techniques to provide insights to support decision-making in the city of São José dos Campos (SP), Brazil. Using a database with information from people who undergo the Covid-19 test in this city, we generate and evaluate predictive models related to severity, need for hospitalization and period of hospitalization. Additionally, we used the SHAP value for models' interpretation of the most decisive attributes influencing the predictions. We can conclude that patient age linked to symptoms such as saturation and respiratory distress and comorbidities such as cardiovascular disease and diabetes are the most important factors to consider when one wants to predict severity and need for hospitalization in this city. We also stress the need of a greater attention to the proper collection of this information from citizens who undergo the Covid-19 diagnosis test.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.