Composite Bloom Filters for Secure Record Linkage

Durham, Elizabeth; Kantarcıoğlu, Murat; Xue, Yuan; Tóth, Csaba D.; Kuzu, Mehmet; Malin, Bradley

doi:10.1109/tkde.2013.91

Cited by 76 publications

(84 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, instead of deleting characters, sampling can be used. Durham (2012) published a variation of CLKs, denoted by Durham et al (2013) as composite Bloom filters. Bit positions from separate Bloom filters for each identifier are sampled.…”

Section: Sampling Bits For Composite Bloom Filtersmentioning

confidence: 99%

Privacy‐preserving record linkage

Schnell

2015

Methodological Developments in Data Linkage

View full text Add to dashboard Cite

Section: Sampling Bits For Composite Bloom Filtersmentioning

confidence: 99%

Privacy‐preserving record linkage

Schnell

2015

Methodological Developments in Data Linkage

View full text Add to dashboard Cite

“…Durham et al [75] proposed a protocol for probabilistic record linkage based on a Bloom filter with the objective of avoiding possible frequency-based cryptanalysis by encoding each identifier of a record with a separate Bloom filter. They introduced a method for encoding the set of identifiers of a record as a Bloom filter.…”

Section: Privacy-preserving Record Linkagementioning

confidence: 99%

“…They introduced a method for encoding the set of identifiers of a record as a Bloom filter. In contrast to the best practice protocol, the PPRL protocols [73,75] do not allow manual review, as the identifiers are encrypted.…”

Section: Privacy-preserving Record Linkagementioning

confidence: 99%

A Survey of Privacy-Preserving Techniques for Reuse of Distributed Health Data (Preprint)

Yigzaw¹,

Berntsen²,

Hartvigsen³

et al. 2018

Preprint

View full text Add to dashboard Cite

Background: Large amounts of detailed electronic health data are being collected. Reuse of these data has enormous potential for scientific discoveries that enables the improvement of healthcare systems' effectiveness, efficiency, and quality of care. However, health data reuse should protect the privacy interests of the stakeholders (i.e., patients and healthcare providers) and promote public good through research. This is particularly challenging when the data are distributed across several data custodians.

show abstract

“…Unfortunately, the use of basic bloom fil-ters is still being proposed in the medical informatics community as a privacy preserving method [34,38]. In order to address frequency attacks on basic bloom filters, Durham et al propose combining multiple Bloom filters by using a statistically informed method of sampling [11]. The method makes frequency attacks difficult, but requires the tuning of a security parameter that can affect linkage results.…”

Section: Related Workmentioning

confidence: 99%

“…British Columbia voter's list; Adly used datasets with 4,000, 10,000 and 20,000 records, generated by sampling from the list; manually controlled and identified the percentage of similar records between each set pair Schnell et al [39] Two German private administration databases, each with about 15,000 records Durham et al [10] Created 100 datasets with 1,000 records in each from the identifiers and demographics within the patient records in the electronic medical record system of the Vanderbilt University Medical Center; data sets to link to are generated from these 100 sets using a "data corrupter" DuVall et al [14] Used the enterprise data warehouse of the University of Utah Health Sciences Center; 118,404 known duplicate record pairs, identified using the Utah Population Database Karakasidis et al [25] Used the FEBRL synthetic data generator [6] for performance and accuracy experiments Kuzu et al [26] A sample of 20,000 records from the North Carolina voter's registration list; to evaluate the effect of typographical and semantic name errors, the sample was synthetically corrupted Durham et al [11] Ten independent samples of 100,000 records from the North Carolina voter's registration list; each sample was independently corrupted to generate samples at the second party Dusetzina et al [12] Individuals in the North Carolina Central Cancer Registry (NCCCR) diagnosed with colon cancer linked to enrollment and claims data for beneficiaries in privately insured health plans in North Carolina; 104,360 record pairs Gruenheid et al [21] Cora dataset; Biz dataset consisting of multiple versions of a business records dataset, each with 4,892 records Randall et al [34] approximately 3.5 × 10 9 record pairs from ten years of the West Australian Hospital Admissions data; approximately 16 × 10 9 record pairs from ten years of the New South Wales admitted patient data Schmidlin et al [38] No experimental evaluation; timing estimated for a linkage attempt with 100,000 records in one data set and 50,000 records in another…”

Section: Linking Health Records For Federated Query Processingmentioning

confidence: 99%

Linking Health Records for Federated Query Processing

Dewri

Ong

Thurimella

2016

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

A federated query portal in an electronic health record infrastructure enables large epidemiology studies by combining data from geographically dispersed medical institutions. However, an individual's health record has been found to be distributed across multiple carrier databases in local settings. Privacy regulations may prohibit a data source from revealing clear text identifiers, thereby making it non-trivial for a query aggregator to determine which records correspond to the same underlying individual. In this paper, we explore this problem of privately detecting and tracking the health records of an individual in a distributed infrastructure. We begin with a secure set intersection protocol based on commutative encryption, and show how to make it practical on comparison spaces as large as 10 10 pairs. Using bigram matching, precomputed tables, and data parallelism, we successfully reduced the execution time to a matter of minutes, while retaining a high degree of accuracy even in records with data entry errors. We also propose techniques to prevent the inference of identifier information when knowledge of underlying data distributions is known to an adversary. Finally, we discuss how records can be tracked utilizing the detection results during query processing.

show abstract

Composite Bloom Filters for Secure Record Linkage

Cited by 76 publications

References 36 publications

Privacy‐preserving record linkage

Privacy‐preserving record linkage

A Survey of Privacy-Preserving Techniques for Reuse of Distributed Health Data (Preprint)

Linking Health Records for Federated Query Processing

Contact Info

Product

Resources

About