2022
DOI: 10.2196/preprints.38590
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

Abstract: BACKGROUND In emergency departments (ED), timely rescue is very important as patients’ conditions usually deteriorate rapidly. Early diagnosis can increase patients’ chances of survival. Early diagnosis can be improved by predictive models based on machine learning using Electronic Medical Record (EMR) data. However, ED data are usually imbalanced, having missing values and sparse features. These quality issues make it challenging to build early identification models for diseases in ED.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
42
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
10

Relationship

0
10

Authors

Journals

citations
Cited by 33 publications
(43 citation statements)
references
References 33 publications
1
42
0
Order By: Relevance
“…In other words, the majority class (i.e., no MHW) has been down-sampled to balance out the categories. Similar techniques have been previously reported to yield higher efficiency in classifying imbalanced datasets by tree classifiers (Drummond & Holte, 2003;Kubat & Matwin, 1997) as well as random forest algorithms (Chen et al, 2004). Preliminary analyses with no balancing showed spurious extreme high accuracies due to model overfitting, that mainly identifies MHW absence throughout the whole northeast Pacific.…”
Section: Datasupporting
confidence: 67%
“…In other words, the majority class (i.e., no MHW) has been down-sampled to balance out the categories. Similar techniques have been previously reported to yield higher efficiency in classifying imbalanced datasets by tree classifiers (Drummond & Holte, 2003;Kubat & Matwin, 1997) as well as random forest algorithms (Chen et al, 2004). Preliminary analyses with no balancing showed spurious extreme high accuracies due to model overfitting, that mainly identifies MHW absence throughout the whole northeast Pacific.…”
Section: Datasupporting
confidence: 67%
“…For the classification RFs, we used the presence of an event (critical phase or COVID‐19‐related death) within 7 days of diagnosis as the outcome of interest during Boruta selection. We used the balanced method by Chen et al 11 both during Boruta selection and modeling with the selected variables. We used survival random forest (RSF) as described by Ishwaran et al, 12 during Boruta selection, and during the final modeling of time‐to‐event data.…”
Section: Methodsmentioning
confidence: 99%
“…7 , is the decision tree model, is the classification result of the decision tree, and is the index function. Since there was a class imbalance problem in the dataset, this study improved the impact of the class imbalance problem on model construction by introducing a sample weight parameter in the RF ( Chen, 2004 ), as shown in Eq. 8 .…”
Section: Methodsmentioning
confidence: 99%