2019
DOI: 10.1038/s41467-019-10933-3
|View full text |Cite
|
Sign up to set email alerts
|

Estimating the success of re-identifications in incomplete datasets using generative models

Abstract: While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
337
0
19

Year Published

2019
2019
2023
2023

Publication Types

Select...
9
1

Relationship

0
10

Authors

Journals

citations
Cited by 544 publications
(407 citation statements)
references
References 40 publications
1
337
0
19
Order By: Relevance
“…Some data scientists have demonstrated that re-identification is highly probable in large datasets and suggest further technical solutions. 46 Rather than relying too heavily on de-identification, data protection must rely on a balance of information security and IG safeguards.…”
Section: Discussionmentioning
confidence: 99%
“…Some data scientists have demonstrated that re-identification is highly probable in large datasets and suggest further technical solutions. 46 Rather than relying too heavily on de-identification, data protection must rely on a balance of information security and IG safeguards.…”
Section: Discussionmentioning
confidence: 99%
“…More recently, she demonstrated the ability to correctly identify 25% of research participants by name and 28% by address from data redacted beyond the HIPAA Safe Harbor standard [99]. Other authors have demonstrated the ability to re-identify at least 90% of Americans utilizing credit card metadata or via statistical models [96,100,101]. Given this emerging area of research, the need to systemically identify all stakeholders and potential data "owners" becomes increasingly essential in the identification of potential downstream security risks to users.…”
Section: Themes Data Transmission and Storagementioning
confidence: 99%
“…Pseudonymization has its limitations (Gymrek et al , ; cf. Glossary ), and developments in machine learning and artificial intelligence already allow re‐identification of even small samples from anonymized data sets (Rocher et al , ). The likelihood of individual re‐identification from genomic data, whether coded or anonymized, is higher when such data have been linked with familial, sociodemographic, or audio‐visual information, as is often the case in rare diseases research (Thu Nguyen et al , ).…”
Section: Uncertainty Around Data Transfers Within the Eumentioning
confidence: 99%