Missing values in deduplication of electronic patient data

Sariyar, Murat; Borg, Andreas; Pommerening, Klaus

doi:10.1136/amiajnl-2011-000461

Cited by 22 publications

(11 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the literature, researchers often treat missing data as disagreements, i.e., γ k ( i , j ) = 0 if δ k ( i , j ) = 1 (e.g., Goldstein and Harron 2015; Ong et al 2014; Sariyar, Borg, and Pommerening 2012). This procedure is problematic because a true match can contain missing values.…”

Section: The Proposed Methodologymentioning

confidence: 99%

See 1 more Smart Citation

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

ENAMORADO,

FIFIELD,

IMAI

2019

Am Polit Sci Rev

118

View full text Add to dashboard Cite

Since most social science research relies on multiple data sources, merging data sets is an essential part of researchers’ workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical model of probabilistic record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.

show abstract

Section: The Proposed Methodologymentioning

confidence: 99%

“… 6 For example, although Goldstein and Harron (2015) suggest the possibility of treating a comparison that involves a missing value as a separate agreement value, but Sariyar, Borg, and Pommerening (2012) find that this approach does not outperform the standard method of treating missing values as disagreements.…”

mentioning

confidence: 99%

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

ENAMORADO,

FIFIELD,

IMAI

2019

Am Polit Sci Rev

118

View full text Add to dashboard Cite

show abstract

“…Past work has primarily focused on identifying relevant features in existing data to infer missing data using classification tree models Prather et al [1997], Sariyar et al [2011]. This representation of a hierarchical structure is achieved by inducing a classification tree on labeled training data, i.e., typically a manually selected subset of the existing data.…”

Section: Data Cleansingmentioning

confidence: 99%

Data Infrastructure for Medical Research

Heinis

Ailamaki

2017

FNT in Databases

View full text Add to dashboard Cite

While we are witnessing rapid growth in data across the sciences and in many applications, this growth is particularly remarkable in the medical domain, be it because of higher resolution instruments and diagnostic tools (e.g. MRI), new sources of structured data like activity trackers, the widespread use of electronic health records and many others. The sheer volume of the data is not, however, the only challenge to be faced when using medical data for research. Other crucial challenges include data heterogeneity, data quality, data privacy and so on. In this article, we review solutions addressing these challenges by discussing the current state of the art in the areas of data integration, data cleaning, data privacy, scalable data access and processing in the context of medical data. The techniques and tools we present will give practitioners-computer scientists and medical researchers alikea starting point to understand the challenges and solutions and ultimately to analyse medical data and gain better and quicker insights.

show abstract

“…There are two alternative approaches to deal with missing values: Impute and Ignore. Imputation treatments [2] [3] fill in attributes in the instance vector using statistical techniques and the complete vector is fed to the predictor. Ignore treatments (also called reduced model or ensemble classifier treatments) [4] [5] overlook the missing attributes, produce a vector based on the available attributes, and feed that vector to a predictor trained on those particular attributes.…”

Section: Introductionmentioning

confidence: 99%

Impute vs. Ignore: Missing values for prediction

Zhang

Rahman

D’Este

2013

The 2013 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Sensor faults or communication errors can cause certain sensor readings to become unavailable for prediction purposes. In this paper we evaluate the performance of imputation techniques and techniques that ignore the missing values, in scenarios: (i) when values are missing only during prediction phase, and (ii) when values are missing during both the induction and prediction phase. We also investigated the influence of different scales of missingness on the performance of these treatments. The results can be used as a guideline to facilitate the choice of different missing value treatments under different circumstances.

show abstract

Missing values in deduplication of electronic patient data

Cited by 22 publications

References 18 publications

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

Data Infrastructure for Medical Research

Impute vs. Ignore: Missing values for prediction

Contact Info

Product

Resources

About