This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.The following are the key messages that we hope to convey.Synthetic data is a technology with significant promise. There are many applications of synthetic data: privacy, fairness, and data augmentation, to name a few. Each of these applications has the potential for a tremendous impact but also comes with risks.Synthetic data can accelerate development. Good quality synthetic data can significantly accelerate data science projects and reduce the cost of the software development lifecycle. When combined with secure research environments and federated learning techniques, it contributes to data democratisation. Synthetic data is not automatically private. A common misconception with synthetic data is that it is inherently private. This is not the case. Synthetic data has the capacity to leak information about the data it was derived from and is vulnerable to privacy attacks. Significant care is required to produce synthetic data that is useful and comes with privacy guarantees.Synthetic data is not a replacement for real data. Synthetic data that comes with privacy guarantees is necessarily a distorted version of the real data. Therefore, any modelling or inference performed on synthetic data comes with additional risks. It is our belief that synthetic data should be used as a tool to accelerate the "research pipeline" but, ultimately, any final tools (that will be deployed in the real world) should be evaluated, and if necessary, fine-tuned, on the real data.Outliers are hard to capture privately. Outliers and low probability events, as are often found in real data, are particularly difficult to capture and include in a synthetic dataset in a private way. For example, it would be very difficult to "hide" a multi-billionaire in synthetic data that contained information about wealth. A synthetic data generator would either not accurately replicate statistics regarding the very wealthy or would reveal potentially private information about these individuals.Empirically evaluating the privacy of a single dataset can be problematic. Rigorous notions of privacy (e.g differential privacy) are a requirement on the mechanism that generated a synthetic dataset, rather than on the dataset itself. It is not possible to rigorously evaluate the privacy of a given synthetic dataset by directly comparing it with real data. Empirical evaluations can prove useful as t...
Summary Although anonymous data are not considered personal data, recent research has shown how individuals can often be re-identified. Scholars have argued that previous findings apply only to small-scale datasets and that privacy is preserved in large-scale datasets. Using 3 months of location data, we (1) show the risk of re-identification to decrease slowly with dataset size, (2) approximate this decrease with a simple model taking into account three population-wide marginal distributions, and (3) prove that unicity is convex and obtain a linear lower bound. Our estimates show that 93% of people would be uniquely identified in a dataset of 60M people using four points of auxiliary information, with a lower bound at 22%. This lower bound increases to 87% when five points are available. Taken together, our results show how the privacy of individuals is very unlikely to be preserved even in country-scale location datasets.
This work addresses the problem of learning from large collections of data with privacy guarantees. The sketched learning framework proposes to deal with the large scale of datasets by compressing them into a single vector of generalized random moments, from which the learning task is then performed. We modify the standard sketching mechanism to provide differential privacy, using addition of Laplace noise combined with a subsampling mechanism (each moment is computed from a subset of the dataset). The data can be divided between several sensors, each applying the privacy-preserving mechanism locally, yielding a differentially-private sketch of the whole dataset when reunited. We apply this framework to the k-means clustering problem, for which a measure of utility of the mechanism in terms of a signal-to-noise ratio is provided, and discuss the obtained privacy-utility tradeoff.
From the "right to be left alone" to the "right to selective disclosure", privacy has long been thought as the control individuals have over the information they share and reveal about themselves. However, in a world that is more connected than ever, the choices of the people we interact with increasingly affect our privacy. This forces us to rethink our definition of privacy. We here formalize and study, as local and global node-and edge-observability, Bloustein's concept of group privacy. We prove edge-observability to be independent of the graph structure, while node-observability depends only on the degree distribution of the graph. We show on synthetic datasets that, for attacks spanning several hops such as those implemented by social networks and current US laws, the presence of hubs increases node-observability while a high clustering coefficient decreases it, at fixed density. We then study the edgeobservability of a large real-world mobile phone dataset over a month and show that, even under the restricted two-hops rule, compromising as little as 1% of the nodes leads to observing up to 46% of all communications in the network. More worrisome, we also show that on average 36% of each person's communications would be locally edge-observable under the same rule. Finally, we use real sensing data to show how people living in cities are vulnerable to distributed node-observability attacks. Using a smartphone app to compromise 1% of the population, an attacker could monitor the location of more than half of London's population. Taken together, our results show that the current individual-centric approach to privacy and data protection does not encompass the realities of modern life. This makes us-as a society-vulnerable to large-scale surveillance attacks which we need to develop protections against.
This work addresses the problem of learning from large collections of data with privacy guarantees. The compressive learning framework proposes to deal with the large scale of datasets by compressing them into a single vector of generalized random moments, called a sketch vector, from which the learning task is then performed. We provide sharp bounds on the so-called sensitivity of this sketching mechanism. This allows us to leverage standard techniques to ensure differential privacy—a well-established formalism for defining and quantifying the privacy of a random mechanism—by adding Laplace of Gaussian noise to the sketch. We combine these standard mechanisms with a new feature subsampling mechanism, which reduces the computational cost without damaging privacy. The overall framework is applied to the tasks of Gaussian modeling, k-means clustering and principal component analysis, for which sharp privacy bounds are derived. Empirically, the quality (for subsequent learning) of the compressed representation produced by our mechanism is strongly related with the induced noise level, for which we give analytical expressions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.