A Baseline for Attribute Disclosure Risk in Synthetic Data

Hittmeir, Markus; Mayer, Rudolf; Ekelhart, Andreas

doi:10.1145/3374664.3375722

Cited by 27 publications

(23 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Attribute Disclosure. This kind of privacy violation happens whenever access to data allows an attacker to learn new information about a specific individual [10], e.g., the value of a particular attribute like race, age, income, etc. Unfortunately, if the real data contains strong correlations between attributes, these correlations will likely be replicated in the synthetic data and available to the adversary [11].…”

Section: Risks Of Using Synthetic Datamentioning

confidence: 99%

What Is Synthetic Data? The Good, The Bad, and The Ugly

Cristofaro¹

2023

Preprint

View full text Add to dashboard Cite

Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of sensitive nature, and thus sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in the research community is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -more precisely, having similar statistical properties.So how do you generate synthetic data? What is that useful for? What are the benefits and the risks? What are the open research questions that remain unanswered? In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology. How To Safely Release Data?Before discussing synthetic data, let's first consider the "alternatives." Anonymization: Theoretically, one could remove personally identifiable information before sharing it. However, in practice, anonymization fails to provide realistic privacy guarantees because a malevolent actor often has auxiliary information that allows them to re-identify anonymized data. For example, when Netflix de-identified movie rankings (as part of a challenge seeking better recommendation systems), Arvind Narayanan and Vitaly Shmatikov [1] de-anonymized a large chunk by cross-referencing them with public information on IMDb. Aggregation: Another approach is to share aggregate statistics about a dataset. For example, telcos can provide statistics about how many people are in some specific locations at a given time -e.g., to assess footfall and decide where one should open a new store. However, this is often ineffective too [2, 3], as the aggregates can still help an adversary learn something about specific individuals. Differential Privacy: More promising attempts come from providing access to statistics obtained from the data while adding noise to the queries' response, guaranteeing differential privacy [4]. However, this approach generally lowers the dataset's utility, especially on high-dimensional data. Additionally, allowing unlimited non-trivial queries on a dataset can reveal the whole dataset, so this approach needs to keep track of the privacy budget over time. Types of Synthetic DataThere are different approaches to generating synthetic data. Derek Snow of the Alan Turing Institute lists three main methods:1. Hand-engineered methods identify an underlying distribution from real data using expert opinion and seek to imitate it.2. Agent-based models establish known agents and allow them to interact according to prescribed rules hoping that this interaction would ultimately amount to distribution profiles that look similar to the original dataset.

show abstract

Section: Risks Of Using Synthetic Datamentioning

confidence: 99%

What Is Synthetic Data? The Good, The Bad, and The Ugly

Cristofaro¹

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The construction of synthetic datasets and their utility metrics have become an exciting research problem . Further exploration of this avenue also compared the protection provided by fake data against conventional methods like k-anonymization (Hittmeir et al 2020). Recent findings showed that synthetic datasets having similar statistical properties as real data may offer privacy protection against inference attacks.…”

Section: Literature Surveymentioning

confidence: 99%

Sarve: synthetic data and local differential privacy for private frequency estimation

Varma

Chauhan

Singh³

2022

Cybersecurity

View full text Add to dashboard Cite

The collection of user attributes by service providers is a double-edged sword. They are instrumental in driving statistical analysis to train more accurate predictive models like recommenders. The analysis of the collected user data includes frequency estimation for categorical attributes. Nonetheless, the users deserve privacy guarantees against inadvertent identity disclosures. Therefore algorithms called frequency oracles were developed to randomize or perturb user attributes and estimate the frequencies of their values. We propose Sarve, a frequency oracle that used Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) and Hadamard Response (HR) for randomization in combination with fake data. The design of a service-oriented architecture must consider two types of complexities, namely computational and communication. The functions of such systems aim to minimize the two complexities and therefore, the choice of privacy-enhancing methods must be a calculated decision. The variant of RAPPOR we had used was realized through bloom filters. A bloom filter is a memory-efficient data structure that offers time complexity of O(1). On the other hand, HR has been proven to give the best communication costs of the order of log(b) for b-bits communication. Therefore, Sarve is a step towards frequency oracles that exhibit how privacy provisions of existing methods can be combined with those of fake data to achieve statistical results comparable to the original data. Sarve also implemented an adaptive solution enhanced from the work of Arcolezi et al. The use of RAPPOR was found to provide better privacy-utility tradeoffs for specific privacy budgets in both high and general privacy regimes.

show abstract

“…However, there are a growing number of technical definitions and assessments of privacy being introduced, that serve practitioners well to make the legal case. Two commonly used concepts within the context of synthetic data are empirical attribute disclosure assessments ( Taub et al, 2018 ; Hittmeir et al, 2020 ), and Differential Privacy ( Dwork et al, 2006 ). Both of these have proven to be useful in establishing trust in the safety of synthetic data, yet come with their own challenges in practice.…”

Section: Related Workmentioning

confidence: 99%

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Platzer

Reutterer

2021

Front. Big Data

View full text Add to dashboard Cite

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

show abstract

A Baseline for Attribute Disclosure Risk in Synthetic Data

Cited by 27 publications

References 14 publications

What Is Synthetic Data? The Good, The Bad, and The Ugly

What Is Synthetic Data? The Good, The Bad, and The Ugly

Sarve: synthetic data and local differential privacy for private frequency estimation

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data

Contact Info

Product

Resources

About