Benchmarking Differentially Private Synthetic Data Generation Algorithms

Tao, Yuchao; McKenna, Ryan M.; Hay, Michael P.; Machanavajjhala, Ashwin; Miklau, Gerome

doi:10.48550/arxiv.2112.09238

Cited by 12 publications

(18 citation statements)

References 10 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another well-known differential privacy mechanism is the randomized response, whose standard deviation is O √ N ε [32], which is worse than the standard deviation of the Laplace and Gaussian mechanism, which is O 1 ε . There are also differential privacy mechanisms based on data synthesis [33]. However, as anomaly detection algorithms look for "spiking" behaviors at a particular time interval, these data synthetic approaches, which try to replicate the distribution of the data as a whole, will not be able to retain the spikes as well as the perturbation mechanisms.…”

Section: Discussionmentioning

confidence: 99%

Detecting Anomalous LAN Activities under Differential Privacy

Rattanavipanon

Ponnoprat

Ochiai

et al. 2022

Security and Communication Networks

View full text Add to dashboard Cite

Anomaly detection has emerged as a popular technique for detecting malicious activities in local area networks (LANs). Various aspects of LAN anomaly detection have been widely studied. Nonetheless, the privacy concern about individual users or their relationship in LAN has not been thoroughly explored in the prior work. In some realistic cases, the anomaly detection analysis needs to be carried out by an external party, located outside the LAN. Thus, it is important for the LAN admin to release LAN data to this party in a private way in order to protect privacy of LAN users; at the same time, the released data must also preserve the utility of being able to detect anomalies. This paper investigates the possibility of privately releasing ARP data that can later be used to identify anomalies in LAN. We present four approaches, namely, naïve, histogram-based, naïve- δ , and histogram-based- δ and show that they satisfy different levels of differential privacy—a rigorous and provable notion for quantifying privacy loss in a system. Our real-world experimental results confirm practical feasibility of our approaches. With a proper privacy budget, all of our approaches preserve more than 75% utility of detecting anomalies in the released data.

show abstract

Section: Discussionmentioning

confidence: 99%

Detecting Anomalous LAN Activities under Differential Privacy

Rattanavipanon

Ponnoprat

Ochiai

et al. 2022

Security and Communication Networks

View full text Add to dashboard Cite

show abstract

“…After testing, the analyst can proceed with the real data (without "seeing" it). Synthetic data generation could rely on simple techniques (e.g., sampling from a normal distribution with the same mean and standard deviation as the target attribute), ML [18,96,112], or combining DP with either.…”

Section: Key System Desideratamentioning

confidence: 99%

“…Similarly to tools offering DP ML [39,52,74,90], we suggest developers package and include DP SDG logic. Guidance: [17,18,96,98,106,109,110,112]. Gap 4: (V) Visualization.…”

Section: Gaps In Differential Privacy Practicementioning

confidence: 99%

Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry

Garrido¹,

Liu

Matthes³

et al. 2023

PoPETs

View full text Add to dashboard Cite

Since its introduction in 2006, differential privacy has emerged as a predominant statistical tool for quantifying data privacy in academic works. Yet despite the plethora of research and open-source utilities that have accompanied its rise, with limited exceptions, differential privacy has failed to achieve widespread adoption in the enterprise domain. Our study aims to shed light on the fundamental causes underlying this academic-industrial utilization gap through detailed interviews of 24 privacy practitioners across 9 major companies. We analyze the results of our survey to provide key findings and suggestions for companies striving to improve privacy protection in their data workflows and highlight the necessary and missing requirements of existing differential privacy tools, with the goal of guiding researchers working towards the broader adoption of differential privacy. Our findings indicate that analysts suffer from lengthy bureaucratic processes for requesting access to sensitive data, yet once granted, only scarcely-enforced privacy policies stand between rogue practitioners and misuse of private information. We thus argue that differential privacy can significantly improve the processes of requesting and conducting data exploration across silos, and conclude that with a few of the improvements suggested herein, the practical use of differential privacy across the enterprise is within striking distance.

show abstract

“…Patki et al pushed this further, by distributing synthetic datasets and real datasets randomly to teams of data scientists, and evaluating whether teams working on real and synthetic datasets would arrive at approximately the same conclusions [106]. Similar approaches were used by Tao et al [107], where a XGBoost classifier is trained on synthetic data and evaluated on real data for a range of different tabular datasets.…”

Section: Utility-driven Evaluationmentioning

confidence: 99%

“…A simple example is to focus on 1-and 2-way marginals of the data, which can be efficiently computed. The difference between these marginals can be estimated with a wide range of metrics: total variational distance [107], correlations and Cramer's V [107], or classical distances [108]. These metrics aim to capture whether the synthetic data captures basic properties of the real data, such as histograms of individual attributes and relations between pairs of attributes.…”

Section: Fidelity-driven Evaluationmentioning

confidence: 99%

Synthetic Data -- what, why and how?

Jordon¹,

Szpruch²,

Houssiau³

et al. 2022

Preprint

View full text Add to dashboard Cite

This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.The following are the key messages that we hope to convey.Synthetic data is a technology with significant promise. There are many applications of synthetic data: privacy, fairness, and data augmentation, to name a few. Each of these applications has the potential for a tremendous impact but also comes with risks.Synthetic data can accelerate development. Good quality synthetic data can significantly accelerate data science projects and reduce the cost of the software development lifecycle. When combined with secure research environments and federated learning techniques, it contributes to data democratisation. Synthetic data is not automatically private. A common misconception with synthetic data is that it is inherently private. This is not the case. Synthetic data has the capacity to leak information about the data it was derived from and is vulnerable to privacy attacks. Significant care is required to produce synthetic data that is useful and comes with privacy guarantees.Synthetic data is not a replacement for real data. Synthetic data that comes with privacy guarantees is necessarily a distorted version of the real data. Therefore, any modelling or inference performed on synthetic data comes with additional risks. It is our belief that synthetic data should be used as a tool to accelerate the "research pipeline" but, ultimately, any final tools (that will be deployed in the real world) should be evaluated, and if necessary, fine-tuned, on the real data.Outliers are hard to capture privately. Outliers and low probability events, as are often found in real data, are particularly difficult to capture and include in a synthetic dataset in a private way. For example, it would be very difficult to "hide" a multi-billionaire in synthetic data that contained information about wealth. A synthetic data generator would either not accurately replicate statistics regarding the very wealthy or would reveal potentially private information about these individuals.Empirically evaluating the privacy of a single dataset can be problematic. Rigorous notions of privacy (e.g differential privacy) are a requirement on the mechanism that generated a synthetic dataset, rather than on the dataset itself. It is not possible to rigorously evaluate the privacy of a given synthetic dataset by directly comparing it with real data. Empirical evaluations can prove useful as t...

show abstract

Benchmarking Differentially Private Synthetic Data Generation Algorithms

Cited by 12 publications

References 10 publications

Detecting Anomalous LAN Activities under Differential Privacy

Detecting Anomalous LAN Activities under Differential Privacy

Lessons Learned: Surveying the Practicality of Differential Privacy in the Industry

Synthetic Data -- what, why and how?

Contact Info

Product

Resources

About