2021
DOI: 10.48550/arxiv.2102.03314
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On Utility and Privacy in Synthetic Genomic Data

Abstract: Genomic data provides researchers with an invaluable source of information to advance progress in biomedical research, personalized medicine, and drug development. At the same time, however, this data is extremely sensitive, which makes data sharing, and consequently availability, problematic if not outright impossible. As a result, organizations have begun to experiment with sharing synthetic data, which should mirror the real data's salient characteristics, without exposing it. In this paper, we provide the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 42 publications
1
4
0
Order By: Relevance
“…But it is not accurate enough to allow these tasks in many of the real-world experimental scenarios that we tested, even though the datasets we are using are relatively small and with a limited number of features. This confirms some of the more recent cautionary tales found in the synthetic data literature Oprisanu et al (2021). • PrivBayes vs. PATE-CTGAN: Comparing the two synthetic methods, PrivBayes is better than PATE-CTGAN for medium and high epsilon bounds, especially in the Adult dataset, but worse than PATE-CTGAN for low epsilon bounds.…”
Section: Real-world Datasupporting
confidence: 79%
See 1 more Smart Citation
“…But it is not accurate enough to allow these tasks in many of the real-world experimental scenarios that we tested, even though the datasets we are using are relatively small and with a limited number of features. This confirms some of the more recent cautionary tales found in the synthetic data literature Oprisanu et al (2021). • PrivBayes vs. PATE-CTGAN: Comparing the two synthetic methods, PrivBayes is better than PATE-CTGAN for medium and high epsilon bounds, especially in the Adult dataset, but worse than PATE-CTGAN for low epsilon bounds.…”
Section: Real-world Datasupporting
confidence: 79%
“…1. Measures that capture the empirical probability of successful identification attacks assuming a motivated intruder with access to certain information (Reiter and Mitra, 2009;Stadler et al, 2020;Oprisanu et al, 2021;Hayes et al, 2018;Hilprecht et al, 2019); 2. Measures that quantify properties of the released data which are a proxy for privacy-for example, k-anonymity (Sweeney, 2002;Wagner and Eckhoff, 2018;Yale et al, 2019); 3.…”
Section: Privacymentioning
confidence: 99%
“…The synthetic patients are generated by a simulator [19] grounded in clinical-genetic knowledge. Furthermore, training on simulated data mitigates concerns regarding privacy breaches, in which specific individuals can be identified from the training data [55,56]. Hence, it is possible to publicly release the fully trained SHEPHERD without privacy concerns.…”
Section: Discussionmentioning
confidence: 99%
“…[3] introduces the framework of differential testing, similar to the notion of DP, which deems a dataset anonymized when the inference accuracy from the synthetic data is about the same whether the user's record is included in the original dataset or not. In the medical domain, [43] and [18,33] evaluate privacy vs. utility of generative models and synthetic datasets, respectively. The first two works rely on the notion of membership inference to quantify the privacy risks, the last one on equivalence classes and intra-record distances.…”
Section: Related Workmentioning
confidence: 99%