Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

Hayet-Otero, Miren; García-García, Fernando; Lee, Dae‐Jin; Martínez‐Minaya, Joaquín; Yandiola, Pedro Pablo España; Landa, Isabel Urrutia; Mónica, Nieves Ermecheo,; Quintana, J. M.; Menéndez, Rosário; Torres, Antoni; Jorge, Rafael Zalacain; Aróstegui, Inmaculada

doi:10.1371/journal.pone.0284150

Cited by 7 publications

(6 citation statements)

References 88 publications

(141 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The genetic algorithm has proven effective in several contexts, including the work of Kabir et al, 20 which introduced a redundancy reduction approach. In our study and in that of Hayet et al, 29 the genetic algorithms provided remarkable results, although the choice of appropriate objective functions could have improved the performance. Focusing on the detection of COVID-19, the study by Hayet et al, 29 highlights the importance of specific variables, such as CRP, Respiratory Rate, Oxygen Saturation, and LDH.…”

Section: Variable Reductionsupporting

confidence: 58%

“…In this work, we sought to use diverse algorithms inspired by evolution, physical mechanisms and collective intelligence, as mentioned in the work of Agrawa et al 28 It should be noted that the algorithms applied were quite simple, trying to preserve the minimum version of each one to encourage equality of conditions, so they cannot be directly comparable with those of other works that present cumulative improvements said algorithms. 20,29,21,22,23 Relevant differences with said works will be discussed below.…”

Section: Variable Reductionmentioning

confidence: 99%

See 1 more Smart Citation

Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms

Olán-Ramón,

De la Cruz-Ruiz,

De la Cruz-Cano

et al. 2024

F1000Res

View full text Add to dashboard Cite

Background COVID-19 is a global public health problem. Aim The main objective of this research is to evaluate and compare the performance of the algorithms: Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, and Neural Network, using metrics such as precision, recall, F1-score and accuracy. Methods A dataset (n=138) was used, with numerical and categorical variables. The algorithms Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, and Neural Network were considered. These were trained using an 80-20 ratio. The following metrics were evaluated: precision, recall, F1-Score, and 5-fold stratified cross-validation. Results The Random Forest algorithm was superior, achieving a maximum score of 0.9727 in cross-validation. The correlation analysis identified ferritin (0.8277) and oxygen saturation (-0.6444). The heuristic model was compared with metaheuristics models. Models obtained through metaheuristic search could maintaining the metrics with 3 variables and stable weight distribution. A perplexity analysis it allows to differentiate between the best models. The features of creatinine and ALT are highlighted in the model with the best CV score and the lowest perplexity. Conclusion Comparative analysis of different classification models was carried out to predict the severity of COVID-19 cases with biological markers.

show abstract

Section: Variable Reductionsupporting

confidence: 58%

Section: Variable Reductionmentioning

confidence: 99%

Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms

Olán-Ramón,

De la Cruz-Ruiz,

De la Cruz-Cano

et al. 2024

F1000Res

View full text Add to dashboard Cite

show abstract

“…Table 3 covers the main clinical characterization of our clustering results per explanatory variable; focused on those 18 / 92 variables with statistically large inter-phenotype effect sizes. This subset of 18 key variables can also be viewed as a data-driven selection of the most informative factors for predicting the clinical outcomes under study (severity, mortality) [24,29]. In particular, phenotype C (low prevalence, 9.0%) included older patients, with more comorbidities, worse respiratory status (peripheral oxygenation, as well as in the arterial blood gas tests), and more unfavourable inflammatory, renal and/or hematologic biomarkers (C-reactive protein, procalcitonin, D-dimer, neutrophils-to-lymphocyte ratio, creatinine, BUN, prothrombin, etc.)…”

Section: Discussionmentioning

confidence: 99%

“…From the demographic and clinical information collected at the baseline time of hospitalization, 92 explanatory variables met our criterion of <60% missingness: whereas other 14 variables -e.g. ferritin, bilirubin, albumin, troponin, interleukin-6 (IL-6), aspartate aminotransferase (AST), creatine phosphokinase (CPK), platelets or eosinophils-failed to match this data quality criterion (see [24] for further details). Once categorical variables were transformed via one-hot encoding, these 92 attributes became d=109 features.…”

Section: Cohortmentioning

confidence: 99%

Obtaining patient phenotypes in SARS-CoV-2 pneumonia, and their association with clinical severity and mortality

García-García,

Lee,

Nieves-Ermecheo

et al. 2024

Pneumonia

Self Cite

View full text Add to dashboard Cite

Background There exists consistent empirical evidence in the literature pointing out ample heterogeneity in terms of the clinical evolution of patients with COVID-19. The identification of specific phenotypes underlying in the population might contribute towards a better understanding and characterization of the different courses of the disease. The aim of this study was to identify distinct clinical phenotypes among hospitalized patients with SARS-CoV-2 pneumonia using machine learning clustering, and to study their association with subsequent clinical outcomes as severity and mortality. Methods Multicentric observational, prospective, longitudinal, cohort study conducted in four hospitals in Spain. We included adult patients admitted for in-hospital stay due to SARS-CoV-2 pneumonia. We collected a broad spectrum of variables to describe exhaustively each case: patient demographics, comorbidities, symptoms, physiological status, baseline examinations (blood analytics, arterial gas test), etc. For the development and internal validation of the clustering/phenotype models, the dataset was split into training and test sets (50% each). We proposed a sequence of machine learning stages: feature scaling, missing data imputation, reduction of data dimensionality via Kernel Principal Component Analysis (KPCA), and clustering with the k-means algorithm. The optimal cluster model parameters –including k, the number of phenotypes– were chosen automatically, by maximizing the average Silhouette score across the training set. Results We enrolled 1548 patients, each of them characterized by 92 clinical attributes (d=109 features after variable encoding). Our clustering algorithm identified k=3 distinct phenotypes and 18 strongly informative variables: Phenotype A (788 cases [50.9% prevalence] – age$$\sim$$ ∼ 57, Charlson comorbidity$$\sim$$ ∼ 1, pneumonia CURB-65 score$$\sim$$ ∼ 0 to 1, respiratory rate at admission$$\sim$$ ∼ 18 min-1, FiO2$$\sim$$ ∼ 21%, C-reactive protein CRP$$\sim$$ ∼ 49.5 mg/dL [median within cluster]); phenotype B (620 cases [40.0%] – age$$\sim$$ ∼ 75, Charlson$$\sim$$ ∼ 5, CURB-65$$\sim$$ ∼ 1 to 2, respiration$$\sim$$ ∼ 20 min-1, FiO2$$\sim$$ ∼ 21%, CRP$$\sim$$ ∼ 101.5 mg/dL); and phenotype C (140 cases [9.0%] – age$$\sim$$ ∼ 71, Charlson$$\sim$$ ∼ 4, CURB-65$$\sim$$ ∼ 0 to 2, respiration$$\sim$$ ∼ 30 min-1, FiO2$$\sim$$ ∼ 38%, CRP$$\sim$$ ∼ 152.3 mg/dL). Hypothesis testing provided solid statistical evidence supporting an interaction between phenotype and each clinical outcome: severity and mortality. By computing their corresponding odds ratios, a clear trend was found for higher frequencies of unfavourable evolution in phenotype C with respect to B, as well as more unfavourable in phenotype B than in A. Conclusion A compound unsupervised clustering technique (including a fully-automated optimization of its internal parameters) revealed the existence of three distinct groups of patients – phenotypes. In turn, these showed strong associations with the clinical severity in the progression of pneumonia, and with mortality.

show abstract

“…This process removes irrelevant and redundant features, while the most discriminative ones are kept, allowing for training predictive models with higher discrimination power and better performance. Feature selection is evaluated by measuring the quality of the classification obtained by the selected subset of features (Hayet-Otero et al, 2023;Dabba et al, 2021;Bommert et al, 2020;Liu et al, 2002). Similarly to feature importance, feature selection is also a model-dependent approach.…”

mentioning

confidence: 99%

Feature identification using hypotheses of relevance and a 2D-cascade of SEQENS ensembles

Arlandis,

Llobet,

Navarro-Cerdán

et al. 2024

Preprint

View full text Add to dashboard Cite

SEQENS is an ensemble method aimed at feature identification that has demonstrated strong performance in identifying relevant features across different synthetic tasks (Signol et al, 2023). In this paper, we present a framework based on SEQENS spanning the following contributions: (1) computing the hypergeometric p-value of the features of a SEQENS output ranking in order to threshold into relevant and non-relevant features; (2) extending SEQENS by introducing the use of preselected features as hypotheses of relevance in the SFS, which may help to attract other features that might exhibit weak correlation with the target on their own, but gain relevance when combined with the preselected ones; (3) designing an automated process based on a 2D-cascade of SEQENS ensembles to obtain a \emph{purged feature set}, or PFS, i.e., having as many relevant features, and as few non-relevant, as possible; (4) integrating all the former techniques so that the PFS is used as in a SEQENS ensemble, which corresponds the complete framefork named pc-SEQENS. The performance of pc-SEQENS is measured on a task of gene expression identification using the E-MTAB-3732 public database and synthetic groundtruths. pc-SEQENS is compared to other feature identification state-of-the-art methods, including SEQENS. On average, the proposed framework identifies better the relevant genes, specially in the most unfavorable sample-to-dimension rates, and exhibits a stronger stability.

show abstract

Extracting relevant predictive variables for COVID-19 severity prognosis: An exhaustive comparison of feature selection techniques

Cited by 7 publications

References 88 publications

Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms

Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms

Obtaining patient phenotypes in SARS-CoV-2 pneumonia, and their association with clinical severity and mortality

Feature identification using hypotheses of relevance and a 2D-cascade of SEQENS ensembles

Contact Info

Product

Resources

About