Privacy-preserving record linkage on large real world datasets

Randall, Sean; Ferrante, Anna; Boyd, James H.; Bauer, Jacqueline K.; Semmens, James B.

doi:10.1016/j.jbi.2013.12.003

Cited by 105 publications

(121 citation statements)

References 19 publications

Supporting

Mentioning

100

Contrasting

Unclassified

Order By: Relevance

“…Bloom filter-based record linkage has been used in real-world medical applications, such as in Brazil (Napoleão Rocha 2013), Germany (Schnell 2014) and Switzerland (Kuehni et al 2011). The largest application so far has been an Australian study (Randall et al 2014). Here, healthcare data with more than 26 million records have been used.…”

Section: { S Sm M I It T H H }mentioning

confidence: 99%

Privacy‐preserving record linkage

Schnell

2015

Methodological Developments in Data Linkage

View full text Add to dashboard Cite

Section: { S Sm M I It T H H }mentioning

confidence: 99%

Privacy‐preserving record linkage

Schnell

2015

Methodological Developments in Data Linkage

View full text Add to dashboard Cite

“…Recently, Neidermeyer et al proposed an easier attack and demonstrated that Bloom filter encodings can be broken without the need for high computational resources [31]. Unfortunately, the use of basic bloom fil-ters is still being proposed in the medical informatics community as a privacy preserving method [34,38]. In order to address frequency attacks on basic bloom filters, Durham et al propose combining multiple Bloom filters by using a statistically informed method of sampling [11].…”

Section: Related Workmentioning

confidence: 99%

“…British Columbia voter's list; Adly used datasets with 4,000, 10,000 and 20,000 records, generated by sampling from the list; manually controlled and identified the percentage of similar records between each set pair Schnell et al [39] Two German private administration databases, each with about 15,000 records Durham et al [10] Created 100 datasets with 1,000 records in each from the identifiers and demographics within the patient records in the electronic medical record system of the Vanderbilt University Medical Center; data sets to link to are generated from these 100 sets using a "data corrupter" DuVall et al [14] Used the enterprise data warehouse of the University of Utah Health Sciences Center; 118,404 known duplicate record pairs, identified using the Utah Population Database Karakasidis et al [25] Used the FEBRL synthetic data generator [6] for performance and accuracy experiments Kuzu et al [26] A sample of 20,000 records from the North Carolina voter's registration list; to evaluate the effect of typographical and semantic name errors, the sample was synthetically corrupted Durham et al [11] Ten independent samples of 100,000 records from the North Carolina voter's registration list; each sample was independently corrupted to generate samples at the second party Dusetzina et al [12] Individuals in the North Carolina Central Cancer Registry (NCCCR) diagnosed with colon cancer linked to enrollment and claims data for beneficiaries in privately insured health plans in North Carolina; 104,360 record pairs Gruenheid et al [21] Cora dataset; Biz dataset consisting of multiple versions of a business records dataset, each with 4,892 records Randall et al [34] approximately 3.5 × 10 9 record pairs from ten years of the West Australian Hospital Admissions data; approximately 16 × 10 9 record pairs from ten years of the New South Wales admitted patient data Schmidlin et al [38] No experimental evaluation; timing estimated for a linkage attempt with 100,000 records in one data set and 50,000 records in another…”

Section: Linking Health Records For Federated Query Processingmentioning

confidence: 99%

Linking Health Records for Federated Query Processing

Dewri

Ong

Thurimella

2016

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

A federated query portal in an electronic health record infrastructure enables large epidemiology studies by combining data from geographically dispersed medical institutions. However, an individual's health record has been found to be distributed across multiple carrier databases in local settings. Privacy regulations may prohibit a data source from revealing clear text identifiers, thereby making it non-trivial for a query aggregator to determine which records correspond to the same underlying individual. In this paper, we explore this problem of privately detecting and tracking the health records of an individual in a distributed infrastructure. We begin with a secure set intersection protocol based on commutative encryption, and show how to make it practical on comparison spaces as large as 10 10 pairs. Using bigram matching, precomputed tables, and data parallelism, we successfully reduced the execution time to a matter of minutes, while retaining a high degree of accuracy even in records with data entry errors. We also propose techniques to prevent the inference of identifier information when knowledge of underlying data distributions is known to an adversary. Finally, we discuss how records can be tracked utilizing the detection results during query processing.

show abstract

“…Unfortunately, records to be linked across different datasets often lack unique 15 identifiers for performing such an identifying and aggregating process [1]. To overcome this problem, many techniques have been developed for record linkage over the past decade [5] in various applications.…”

Section: Introductionmentioning

confidence: 99%