Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Emam, Khaled El; Mosquera, Lucy; Bass, Jason

doi:10.2196/23139

Cited by 50 publications

(57 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…71 Furthermore, existing evaluations have concluded that the privacy risks using sequential tree synthesis is low. 47 , 72 With these types of models, a variable is synthesized by using the values earlier in the sequence as predictors. Conceptually, sequential synthesis is similar to modeling multiple outcome variables using classifier chains 73 and regressor chains.…”

Section: Methodsmentioning

confidence: 99%

“…The first is attribute disclosure conditional on identity disclosure, which assesses the probability of mapping a synthetic record to a real person, and conditional on that learning something new about the individual. 47 The second is membership disclosure which assesses whether an adversary would reliably know whether a target individual was in the real dataset used for synthesis. The details of the methods used for each of these two evaluations are provided in the appendix.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating the utility of synthetic COVID-19 case data

Emam

Mosquera²,

Jonker

et al. 2021

JAMIA Open

Self Cite

View full text Add to dashboard Cite

Background Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. Objectives Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. Methods A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. Results The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. Conclusions This synthetic dataset could be used as a proxy for the real dataset.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Evaluating the utility of synthetic COVID-19 case data

Emam

Mosquera²,

Jonker

et al. 2021

JAMIA Open

Self Cite

View full text Add to dashboard Cite

show abstract

“…An adversary may have access to partial information (quasi-identifiers such as age and gender) about individuals in the population and may attempt to determine whether additional information about an individual can be gained from the synthetic dataset (population-to-sample attack), or whether an individual in the synthetic dataset can be matched to an individual in the population (sample-to-population attack). Under the assumption that an adversary will only attempt one of these attacks, but without knowing which one, the overall probability of one of these attacks being successful is given by the maximum probability of either attack being successful (El Emam et al, 2020).…”

Section: Methods: Identity Disclosure Riskmentioning

confidence: 99%

“…The datasets were generated and published as part of the Health Gym, a project aiming to publicly distribute synthetic longitudinal health data for developing machine learning algorithms (with a particular focus on offline reinforcement learning) and for educational purposes. The datasets are highly realistic (a publication detailing the generation and quality assurance process is currently in preparation) and here we report on the risk of identity disclosure associated with the release of these data, using current best practices (Goncalves et al, 2020;El Emam et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Synthetic Acute Hypotension and Sepsis Datasets Based on MIMIC-III and Published as Part of the Health Gym Project

Kuo¹,

Polizzotto²,

Finfer³

et al. 2021

Preprint

View full text Add to dashboard Cite

These two synthetic datasets comprise vital signs, laboratory test results, administered fluid boluses and vasopressors for 3, 910 patients with acute hypotension and for 2, 164 patients with sepsis in the Intensive Care Unit (ICU). The patient cohorts were built using previously published inclusion and exclusion criteria and the data were created using Generative Adversarial Networks (GANs) and the MIMIC-III Clinical Database. The risk of identity disclosure associated with the release of these data was estimated to be very low (0.045%). The datasets were generated and published as part of the Health Gym, a project aiming to publicly distribute synthetic longitudinal health data for developing machine learning algorithms (with a particular focus on offline reinforcement learning) and for educational purposes.

show abstract

“…Besides addressing the privacy concerns, synthetic data is an effective way to increase the amount of available data without additional costs because of its additive nature [3,4]. Prior work showed exciting results when generating both structured [5] and unstructured medical data [2]. In particular, recent advances in neural language modeling show promising results in generating high-quality and realistic text [6].…”

Section: Introductionmentioning

confidence: 99%

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Libbi

Trienes

Trieschnigg³

et al. 2021

Future Internet

View full text Add to dashboard Cite

A major hurdle in the development of natural language processing (NLP) methods for Electronic Health Records (EHRs) is the lack of large, annotated datasets. Privacy concerns prevent the distribution of EHRs, and the annotation of data is known to be costly and cumbersome. Synthetic data presents a promising solution to the privacy concern, if synthetic data has comparable utility to real data and if it preserves the privacy of patients. However, the generation of synthetic text alone is not useful for NLP because of the lack of annotations. In this work, we propose the use of neural language models (LSTM and GPT-2) for generating artificial EHR text jointly with annotations for named-entity recognition. Our experiments show that artificial documents can be used to train a supervised named-entity recognition model for de-identification, which outperforms a state-of-the-art rule-based baseline. Moreover, we show that combining real data with synthetic data improves the recall of the method, without manual annotation effort. We conduct a user study to gain insights on the privacy of artificial text. We highlight privacy risks associated with language models to inform future research on privacy-preserving automated text generation and metrics for evaluating privacy-preservation during text generation.

show abstract

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Abstract: Generalizations on the quasi-identifiers can be expressed in terms of the example hierarchies as illustrated for the three quasi-identifiers in Figure 1.

Cited by 50 publications

References 48 publications

Evaluating the utility of synthetic COVID-19 case data

Evaluating the utility of synthetic COVID-19 case data

Synthetic Acute Hypotension and Sepsis Datasets Based on MIMIC-III and Published as Part of the Health Gym Project

Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records

Contact Info

Product

Resources

About