“…British Columbia voter's list; Adly used datasets with 4,000, 10,000 and 20,000 records, generated by sampling from the list; manually controlled and identified the percentage of similar records between each set pair Schnell et al [39] Two German private administration databases, each with about 15,000 records Durham et al [10] Created 100 datasets with 1,000 records in each from the identifiers and demographics within the patient records in the electronic medical record system of the Vanderbilt University Medical Center; data sets to link to are generated from these 100 sets using a "data corrupter" DuVall et al [14] Used the enterprise data warehouse of the University of Utah Health Sciences Center; 118,404 known duplicate record pairs, identified using the Utah Population Database Karakasidis et al [25] Used the FEBRL synthetic data generator [6] for performance and accuracy experiments Kuzu et al [26] A sample of 20,000 records from the North Carolina voter's registration list; to evaluate the effect of typographical and semantic name errors, the sample was synthetically corrupted Durham et al [11] Ten independent samples of 100,000 records from the North Carolina voter's registration list; each sample was independently corrupted to generate samples at the second party Dusetzina et al [12] Individuals in the North Carolina Central Cancer Registry (NCCCR) diagnosed with colon cancer linked to enrollment and claims data for beneficiaries in privately insured health plans in North Carolina; 104,360 record pairs Gruenheid et al [21] Cora dataset; Biz dataset consisting of multiple versions of a business records dataset, each with 4,892 records Randall et al [34] approximately 3.5 × 10 9 record pairs from ten years of the West Australian Hospital Admissions data; approximately 16 × 10 9 record pairs from ten years of the New South Wales admitted patient data Schmidlin et al [38] No experimental evaluation; timing estimated for a linkage attempt with 100,000 records in one data set and 50,000 records in another…”