2021
DOI: 10.48550/arxiv.2112.09238
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Benchmarking Differentially Private Synthetic Data Generation Algorithms

Abstract: This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(18 citation statements)
references
References 10 publications
(14 reference statements)
0
18
0
Order By: Relevance
“…Another well-known differential privacy mechanism is the randomized response, whose standard deviation is O √ N ε [32], which is worse than the standard deviation of the Laplace and Gaussian mechanism, which is O 1 ε . There are also differential privacy mechanisms based on data synthesis [33]. However, as anomaly detection algorithms look for "spiking" behaviors at a particular time interval, these data synthetic approaches, which try to replicate the distribution of the data as a whole, will not be able to retain the spikes as well as the perturbation mechanisms.…”
Section: Discussionmentioning
confidence: 99%
“…Another well-known differential privacy mechanism is the randomized response, whose standard deviation is O √ N ε [32], which is worse than the standard deviation of the Laplace and Gaussian mechanism, which is O 1 ε . There are also differential privacy mechanisms based on data synthesis [33]. However, as anomaly detection algorithms look for "spiking" behaviors at a particular time interval, these data synthetic approaches, which try to replicate the distribution of the data as a whole, will not be able to retain the spikes as well as the perturbation mechanisms.…”
Section: Discussionmentioning
confidence: 99%
“…After testing, the analyst can proceed with the real data (without "seeing" it). Synthetic data generation could rely on simple techniques (e.g., sampling from a normal distribution with the same mean and standard deviation as the target attribute), ML [18,96,112], or combining DP with either.…”
Section: Key System Desideratamentioning
confidence: 99%
“…Similarly to tools offering DP ML [39,52,74,90], we suggest developers package and include DP SDG logic. Guidance: [17,18,96,98,106,109,110,112]. Gap 4: (V) Visualization.…”
Section: Gaps In Differential Privacy Practicementioning
confidence: 99%
“…Patki et al pushed this further, by distributing synthetic datasets and real datasets randomly to teams of data scientists, and evaluating whether teams working on real and synthetic datasets would arrive at approximately the same conclusions [106]. Similar approaches were used by Tao et al [107], where a XGBoost classifier is trained on synthetic data and evaluated on real data for a range of different tabular datasets.…”
Section: Utility-driven Evaluationmentioning
confidence: 99%
“…A simple example is to focus on 1-and 2-way marginals of the data, which can be efficiently computed. The difference between these marginals can be estimated with a wide range of metrics: total variational distance [107], correlations and Cramer's V [107], or classical distances [108]. These metrics aim to capture whether the synthetic data captures basic properties of the real data, such as histograms of individual attributes and relations between pairs of attributes.…”
Section: Fidelity-driven Evaluationmentioning
confidence: 99%