Xiaojie Chen scite author profile

Xiaojie Chen

2Publications

60Citation Statements Received

80Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

Chen¹,

Chen²,

Nan³

et al. 2022

Preprint

View full text Add to dashboard Cite

BACKGROUND In emergency departments (ED), timely rescue is very important as patients’ conditions usually deteriorate rapidly. Early diagnosis can increase patients’ chances of survival. Early diagnosis can be improved by predictive models based on machine learning using Electronic Medical Record (EMR) data. However, ED data are usually imbalanced, having missing values and sparse features. These quality issues make it challenging to build early identification models for diseases in ED. OBJECTIVE The objective of this study is to propose a systematic approach to deal with missing, imbalanced and sparse feature problems of ED data. METHODS We used random forest and K-means algorithms to interpolate missing values and under-sample data. Regarding sparse features, we used principal component analysis to reduce dimensions. For continuous and discrete variables, the decision coefficient R2 and Kappa coefficient are used to evaluate the performance respectively. The area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPRC) are used to estimate the model performance. To further evaluate the proposed approach, we carried out a case study using an ED dataset extracted from Hainan Hospital of Chinese PLA General Hospital. A logistic regression model for patient condition worsening prediction was built out of the data processed by the proposed approach. RESULTS A total of 1085 patients with rescue record and 17959 patients without rescue record were collected, which were significantly imbalanced. 275, 402 and 891 variables are extracted from laboratory tests, medications and diagnosis, respectively. After data preprocessing, the median R2 of random forest interpolation for continuous variables is 0.623 (IQR: 0.647), and the median of Kappa coefficient for discrete variable interpolation is 0.444 (IQR: 0.285). The logistic regression model constructed using the initial diagnostic data has poor performance and variable separation, which is reflected in the abnormally high OR values of the two variables of cardiac arrest and respiratory arrest (27857.4 and 9341.6) and an abnormal confidence interval. Using the processed data, the recall of the model reaches 0.77, F1-SCORE is 0.74, and AUC is 0.64. CONCLUSIONS We proposed a machine learning method to deal with data quality issues such as missing data, data imbalance, and sparse features in emergency data, so as to improve data availability. A preliminary case study indicate the results produced by the proposed method can be used for building prediction model for emergency patients.

show abstract

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

Chen¹,

Chen²,

Nan³

et al. 2023

JMIR Med Inform

View full text Add to dashboard Cite

Background In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. Objective This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. Methods We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R2 and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. Results A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R2 of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the F1-score was 0.73, and the AUROC was 0.708. Conclusions The proposed systematic approach is valid for building a prediction model for emergency patients.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Xiaojie Chen

Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

Contact Info

Product

Resources

About