Robin Hood and Matthew Effects -- Differential Privacy Has Disparate Impact on Synthetic Data

Ganev, Georgy; Oprisanu, Bristena; Cristofaro, Emiliano De

doi:10.48550/arxiv.2109.11429

Cited by 4 publications

(8 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, we note that with increased privacy PATE-GAN, again, has much lower variability/spread in terms of size and a smaller drop in terms of recall. We also clearly observe the opposing size effects the two generative models exhibit, similarly to [14] -DP-WGAN makes the classes more uniform, i.e., large classes are reduced, and small classes are increased, while PATE-GAN further enforces the imbalance, large classes become even bigger.…”

Section: Mixed Class Resultsmentioning

confidence: 83%

“…We experiment with privacy budgets ( ) of 0.5, 5, 15, and infinity ("non-DP"). We measure the class distributions in the resulting synthetic datasets as well as class recall from classifiers (logistic regression similar to [14]) trained on the real/synthetic data and tested on put-aside test data. We also report RMSE for sizes and truncated 2 RMSE (TRMSE) for recall weighted by the real sizes in App.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…Finally, analyzing the DP disparity on generative models, Cheng et al [8] show that training classifiers on balanced DP synthetic images could result in increased majority subgroup influence and utility degradation. Focusing on tabular data, Pereira et al [28] look at single-attribute subgroup fairness and overall classification while Ganev et al [14] analyze class as well as single/multi-attribute subgroup classification parity over a variety of imbalances and privacy budgets. They find that the disparate effects of DP could be opposing depending on the specific generative model and DP mechanism.…”

Section: Related Workmentioning

confidence: 99%

“…Our work is perhaps closest in spirit to [14,36] but we focus on generative models unlike the former and use image data and have a more disciplined approach to constructing the class imbalances, unlike the latter.…”

Section: Related Workmentioning

confidence: 99%

“…For example, in the case of deep learning classifiers [4,12,30] empirically illustrate the disparate degradation caused by DP-SGD. However, comparisons between DP-SGD and PATE are still relatively unstudied in this light, with Uniyal et al [36] doing so for classifiers, and more recently, Ganev et al [14] demonstrating the said effects in generative models trained on imbalanced tabular data. To fill this gap, we set out to examine and compare two GAN models trained with DP guarantees (DP-WGAN and PATE-GAN) on imbalanced image data (MNIST) in several imbalance settings.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

DP-SGD vs PATE: Which Has Less Disparate Impact on GANs?

Ganev¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Generative Adversarial Networks (GANs) are among the most popular approaches to generate synthetic data, especially images, for data sharing purposes. Given the vital importance of preserving the privacy of the individual data points in the original data, GANs are trained utilizing frameworks with robust privacy guarantees such as Differential Privacy (DP). However, these approaches remain widely unstudied beyond single performance metrics when presented with imbalanced datasets. To this end, we systematically compare GANs trained with the two best-known DP frameworks for deep learning, DP-SGD, and PATE, in different data imbalance settings from two perspectives -the size of the classes in the generated synthetic data and their classification performance.Our analyses show that applying PATE, similarly to DP-SGD, has a disparate effect on the under/over-represented classes but in a much milder magnitude making it more robust. Interestingly, our experiments consistently show that for PATE, unlike DP-SGD, the privacy-utility trade-off is not monotonically decreasing but is much smoother and inverted U-shaped, meaning that adding a small degree of privacy actually helps generalization. However, we have also identified some settings (e.g., large imbalance) where PATE-GAN completely fails to learn some subparts of the training data.

show abstract

Section: Mixed Class Resultsmentioning

confidence: 83%

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

DP-SGD vs PATE: Which Has Less Disparate Impact on GANs?

Ganev¹

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

Rosenblatt,

Herman,

Holovenko

et al. 2024

SIGMOD Rec.

View full text Add to dashboard Cite

Differential privacy (DP) data synthesizers are increasingly proposed to afford public release of sensitive information, offering theoretical guarantees for privacy (and, in some cases, utility), but limited empirical evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, the accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data and comparing the results.

show abstract

Synthetic Data -- what, why and how?

Jordon¹,

Szpruch²,

Houssiau³

et al. 2022

Preprint

View full text Add to dashboard Cite

This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.The following are the key messages that we hope to convey.Synthetic data is a technology with significant promise. There are many applications of synthetic data: privacy, fairness, and data augmentation, to name a few. Each of these applications has the potential for a tremendous impact but also comes with risks.Synthetic data can accelerate development. Good quality synthetic data can significantly accelerate data science projects and reduce the cost of the software development lifecycle. When combined with secure research environments and federated learning techniques, it contributes to data democratisation. Synthetic data is not automatically private. A common misconception with synthetic data is that it is inherently private. This is not the case. Synthetic data has the capacity to leak information about the data it was derived from and is vulnerable to privacy attacks. Significant care is required to produce synthetic data that is useful and comes with privacy guarantees.Synthetic data is not a replacement for real data. Synthetic data that comes with privacy guarantees is necessarily a distorted version of the real data. Therefore, any modelling or inference performed on synthetic data comes with additional risks. It is our belief that synthetic data should be used as a tool to accelerate the "research pipeline" but, ultimately, any final tools (that will be deployed in the real world) should be evaluated, and if necessary, fine-tuned, on the real data.Outliers are hard to capture privately. Outliers and low probability events, as are often found in real data, are particularly difficult to capture and include in a synthetic dataset in a private way. For example, it would be very difficult to "hide" a multi-billionaire in synthetic data that contained information about wealth. A synthetic data generator would either not accurately replicate statistics regarding the very wealthy or would reveal potentially private information about these individuals.Empirically evaluating the privacy of a single dataset can be problematic. Rigorous notions of privacy (e.g differential privacy) are a requirement on the mechanism that generated a synthetic dataset, rather than on the dataset itself. It is not possible to rigorously evaluate the privacy of a given synthetic dataset by directly comparing it with real data. Empirical evaluations can prove useful as t...

show abstract

Robin Hood and Matthew Effects -- Differential Privacy Has Disparate Impact on Synthetic Data

Cited by 4 publications

References 19 publications

DP-SGD vs PATE: Which Has Less Disparate Impact on GANs?

DP-SGD vs PATE: Which Has Less Disparate Impact on GANs?

Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

Synthetic Data -- what, why and how?

Contact Info

Product

Resources

About