Generalizations on the quasi-identifiers can be expressed in terms of the example hierarchies as illustrated for the three quasi-identifiers in Figure 1.
BACKGROUND While there has been growing interest in data synthesis for enabling the sharing of data for secondary analysis, there is a need for a comprehensive privacy risk model for fully synthetic data: if the generative models have been overfit then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS A full risk model is presented which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this meaningful identity disclosure risk. The model is applied on samples from the Washington state hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS The meaningful identity disclosure risk for both of these synthesized samples were below the commonly used 0.09 risk threshold (0.0198 and 0.0086 respectively) and 5x and 10x lower than the risk values for the original datasets. CONCLUSIONS We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on two datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of synthetic data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.