Strategies for Handling Missing Data in Electronic Health Record Derived
            Data

Wells, Brian J.; Chagin, Kevin; Nowacki, Amy S.; Kattan, Michael W.

doi:10.13063/2327-9214.1035

Cited by 268 publications

(238 citation statements)

References 28 publications

Supporting

Mentioning

206

Contrasting

Unclassified

Order By: Relevance

“…Unfortunately, in contrast to confounding bias,11–21 the control of selection bias in EHR-based settings has received virtually no attention in the literature. This may be due, in part, to the notion that selection bias can be cast as a missing data problem and that statistical methods for missing data are well established22,23 and can be readily applied to EHR-based CER 24…”

Section: Introductionmentioning

confidence: 99%

A General Framework for Considering Selection Bias in EHR-Based Studies: What Data are Observed and Why?

Haneuse¹,

Daniels

2016

eGEMs

View full text Add to dashboard Cite

Electronic health records (EHR) data are increasingly seen as a resource for cost-effective comparative effectiveness research (CER). Since EHR data are collected primarily for clinical and/or billing purposes, their use for CER requires consideration of numerous methodologic challenges including the potential for confounding bias, due to a lack of randomization, and for selection bias, due to missing data. In contrast to the recent literature on confounding bias in EHR-based CER, virtually no attention has been paid to selection bias possibly due to the belief that standard methods for missing data can be readily-applied. Such methods, however, hinge on an overly simplistic view of the available/missing EHR data, so that their application in the EHR setting will often fail to completely control selection bias. Motivated by challenges we face in an on-going EHR-based comparative effectiveness study of choice of antidepressant treatment and long-term weight change, we propose a new general framework for selection bias in EHR-based CER. Crucially, the framework provides structure within which researchers can consider the complex interplay between numerous decisions, made by patients and health care providers, which give rise to health-related information being recorded in the EHR system, as well as the wide variability across EHR systems themselves. This, in turn, provides structure within which: (i) the transparency of assumptions regarding missing data can be enhanced, (ii) factors relevant to each decision can be elicited, and (iii) statistical methods can be better aligned with the complexity of the data.

show abstract

Section: Introductionmentioning

confidence: 99%

A General Framework for Considering Selection Bias in EHR-Based Studies: What Data are Observed and Why?

Haneuse¹,

Daniels

2016

eGEMs

View full text Add to dashboard Cite

show abstract

“…We compared imputation using the CMM to population mean imputation (as a baseline), multivariate imputation using chained equations (MICE) [6,38], and k -nearest neighbors imputation. For our purposes, we set the prediction method of MICE to predictive mean matching [6] for non-categorical variables and logistic/polytomous regression for categorical variables.…”

Section: Methodsmentioning

confidence: 99%

Flexible, cluster-based analysis of the electronic medical record of sepsis with composite mixture models

Mayhew

Petersen

Sales

et al. 2018

Journal of Biomedical Informatics

View full text Add to dashboard Cite

The widespread adoption of electronic medical records (EMRs) in healthcare has provided vast new amounts of data for statistical machine learning researchers in their efforts to model and predict patient health status, potentially enabling novel advances in treatment. In the case of sepsis, a debilitating, dysregulated host response to infection, extracting subtle, uncataloged clinical phenotypes from the EMR with statistical machine learning methods has the potential to impact patient diagnosis and treatment early in the course of their hospitalization. However, there are significant barriers that must be overcome to extract these insights from EMR data. First, EMR datasets consist of both static and dynamic observations of discrete and continuous-valued variables, many of which may be missing, precluding the application of standard multivariate analysis techniques. Second, clinical populations observed via EMRs and relevant to the study and management of conditions like sepsis are often heterogeneous; properly accounting for this heterogeneity is critical. Here, we describe an unsupervised, probabilistic framework called a composite mixture model that can simultaneously accommodate the wide variety of observations frequently observed in EMR datasets, characterize heterogeneous clinical populations, and handle missing observations. We demonstrate the efficacy of our approach on a large-scale sepsis cohort, developing novel techniques built on our model-based clusters to track patient mortality risk over time and identify physiological trends and distinct subgroups of the dataset associated with elevated risk of mortality during hospitalization.

show abstract

“…Also, the missing pattern of time series data may also contain information that could improve the performance of model prediction. The other option is to fix the missing values by resampling or interpolation, but these methods may require knowledge of the whole dataset before dealing with missing data, and may result in a two-staged modelling process (Wells et al, 2013). Recent works tried to model explicitly the missingness of various datasets (Wu et al, 2015), or interpolate according to the time series information of missing data in health care dataset (Che et al, 2016).…”

Section: Fixing Missing Valuesmentioning

confidence: 99%

A Spatiotemporal Prediction Framework for Air Pollution Based on Deep RNN

Fan

Hou

et al. 2017

ISPRS Ann. Photogramm. Remote Sens. Spatial Inf. Sci.

128

View full text Add to dashboard Cite

ABSTRACT:Time series data in practical applications always contain missing values due to sensor malfunction, network failure, outliers etc. In order to handle missing values in time series, as well as the lack of considering temporal properties in machine learning models, we propose a spatiotemporal prediction framework based on missing value processing algorithms and deep recurrent neural network (DRNN). By using missing tag and missing interval to represent time series patterns, we implement three different missing value fixing algorithms, which are further incorporated into deep neural network that consists of LSTM (Long Short-term Memory) layers and fully connected layers. Real-world air quality and meteorological datasets (Jingjinji area, China) are used for model training and testing. Deep feed forward neural networks (DFNN) and gradient boosting decision trees (GBDT) are trained as baseline models against the proposed DRNN. Performances of three missing value fixing algorithms, as well as different machine learning models are evaluated and analysed. Experiments show that the proposed DRNN framework outperforms both DFNN and GBDT, therefore validating the capacity of the proposed framework. Our results also provides useful insights for better understanding of different strategies that handle missing values.

show abstract

Strategies for Handling Missing Data in Electronic Health Record Derived Data

Cited by 268 publications

References 28 publications

A General Framework for Considering Selection Bias in EHR-Based Studies: What Data are Observed and Why?

A General Framework for Considering Selection Bias in EHR-Based Studies: What Data are Observed and Why?

Flexible, cluster-based analysis of the electronic medical record of sepsis with composite mixture models

A Spatiotemporal Prediction Framework for Air Pollution Based on Deep RNN

Contact Info

Product

Resources

About