2020
DOI: 10.2196/23139
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation

Abstract: Generalizations on the quasi-identifiers can be expressed in terms of the example hierarchies as illustrated for the three quasi-identifiers in Figure 1.

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
56
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 50 publications
(57 citation statements)
references
References 48 publications
1
56
0
Order By: Relevance
“…71 Furthermore, existing evaluations have concluded that the privacy risks using sequential tree synthesis is low. 47 , 72 With these types of models, a variable is synthesized by using the values earlier in the sequence as predictors. Conceptually, sequential synthesis is similar to modeling multiple outcome variables using classifier chains 73 and regressor chains.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…71 Furthermore, existing evaluations have concluded that the privacy risks using sequential tree synthesis is low. 47 , 72 With these types of models, a variable is synthesized by using the values earlier in the sequence as predictors. Conceptually, sequential synthesis is similar to modeling multiple outcome variables using classifier chains 73 and regressor chains.…”
Section: Methodsmentioning
confidence: 99%
“…The first is attribute disclosure conditional on identity disclosure, which assesses the probability of mapping a synthetic record to a real person, and conditional on that learning something new about the individual. 47 The second is membership disclosure which assesses whether an adversary would reliably know whether a target individual was in the real dataset used for synthesis. The details of the methods used for each of these two evaluations are provided in the appendix.…”
Section: Methodsmentioning
confidence: 99%
“…An adversary may have access to partial information (quasi-identifiers such as age and gender) about individuals in the population and may attempt to determine whether additional information about an individual can be gained from the synthetic dataset (population-to-sample attack), or whether an individual in the synthetic dataset can be matched to an individual in the population (sample-to-population attack). Under the assumption that an adversary will only attempt one of these attacks, but without knowing which one, the overall probability of one of these attacks being successful is given by the maximum probability of either attack being successful (El Emam et al, 2020).…”
Section: Methods: Identity Disclosure Riskmentioning
confidence: 99%
“…The datasets were generated and published as part of the Health Gym, a project aiming to publicly distribute synthetic longitudinal health data for developing machine learning algorithms (with a particular focus on offline reinforcement learning) and for educational purposes. The datasets are highly realistic (a publication detailing the generation and quality assurance process is currently in preparation) and here we report on the risk of identity disclosure associated with the release of these data, using current best practices (Goncalves et al, 2020;El Emam et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…Besides addressing the privacy concerns, synthetic data is an effective way to increase the amount of available data without additional costs because of its additive nature [3,4]. Prior work showed exciting results when generating both structured [5] and unstructured medical data [2]. In particular, recent advances in neural language modeling show promising results in generating high-quality and realistic text [6].…”
Section: Introductionmentioning
confidence: 99%