A Globally Optimal k-Anonymity Method for the De-Identification of Health Data

Emam, Khaled El; Dankar, Fida K.; Issa, Romeo; Jonker, Elizabeth; Amyot, Daniel; Cogo, Elise; Corriveau, Jean-Pierre; Walker, Mark; Chowdhury, Sadrul; Vaillancourt, Régis; Roffey, Tyson; Bottomley, Jim

doi:10.1197/jamia.m3144

Cited by 198 publications

(178 citation statements)

References 33 publications

Supporting

Mentioning

168

Contrasting

Order By: Relevance

“…it can be derived from its predecessor by incrementing the generalization level of exactly one attribute. The number of transformations in a generalization lattice grows exponentially with the number of attributes [15] and a wide variety of globally-optimal and heuristic search algorithms for generalization lattices have been proposed [15,[33][34][35] In this article we will use the following notion. A generalization scheme is a function g :…”

Section: Solution Spaces and Search Strategiesmentioning

confidence: 99%

SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

Bild

Kuhn

Praßer

2018

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

Methods for privacy-preserving data publishing and analysis trade off privacy risks for individuals against the quality of output data. In this article, we present a data publishing algorithm that satisfies the differential privacy model. The transformations performed are truthful, which means that the algorithm does not perturb input data or generate synthetic output data. Instead, records are randomly drawn from the input dataset and the uniqueness of their features is reduced. This also offers an intuitive notion of privacy protection. Moreover, the approach is generic, as it can be parameterized with different objective functions to optimize its output towards different applications. We show this by integrating six well-known data quality models. We present an extensive analytical and experimental evaluation and a comparison with prior work. The results show that our algorithm is the first practical implementation of the described approach and that it can be used with reasonable privacy parameters resulting in high degrees of protection. Moreover, when parameterizing the generic method with an objective function quantifying the suitability of data for building statistical classifiers, we measured prediction accuracies that compare very well with results obtained using state-ofthe-art differentially private classification algorithms.

show abstract

Section: Solution Spaces and Search Strategiesmentioning

confidence: 99%

SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

Bild

Kuhn

Praßer

2018

Proceedings on Privacy Enhancing Technologies

View full text Add to dashboard Cite

show abstract

“…El Emam et al (2009) discuss some of these information loss metrics in detail. Note that they are only useful in making decisions regarding recoding and suppression; they do not give the user/analyst any measure of data utility.The concept of k-anonymity drives several real world systems, including Datafly, k-Similar (Sweeney, 2002); Samarati, Incognito, and Optimal Lattice Anonymization (OLA;El Emam et al, 2009); and µ-argus (Hundepool et al, 2008). Most of these packages use local suppression in addition to global recoding to create a k-anonymous data set.…”

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Avoiding Disclosure of Individually Identifiable Health Information

et al. 2011

View full text Add to dashboard Cite

Achieving data and information dissemination without harming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.Keywords public use files, disclosure avoidance, reidentification, de-identification, data utility 2 SAGE Open inferential disclosure (i.e., information that can be inferred about a record in a data set with better accuracy). There is significant literature on each of these topics, which are beyond the scope of this article.Our article is divided into six sections, of which this "Introduction" is the first. The second section presents "The Policy and Academic Context" surrounding the discussion. The third section discusses the state of the art in "De-Identification Methods," while the fourth emphasizes the state of the art in "Reidentification Methods." The fifth section presents the conclusions from the literature on the different ways in which users may "Access" public data, stressing the trade-offs between (a) confidentiality and utility and (b) confidentiality and ease of access. The last section presents the "Conclusion." The Policy and Academic Context Historic PerspectiveConcerns about privacy and confidentiality in governmental efforts to collect and disseminate information are not new. As a review by Anderson and Seltzer (2009) suggests, "the roots of the modern concept of federal statistical confidentiality can be traced directly back to the late nineteenth century" (p. 8). Notwithstanding this history, the literature on statistical disclosure methods is fairly recent by modern standards (Dalenius, 1977, is considered the seminal paper). In 1975, the U.S. Federal Committee on Statistical Methodology (FCSM) was organized by the Office of Management and Budget (OMB) to investigate issues of data quality affecting federal statistics. As part of this effort, the Subcommittee on Disclosure Limitation Methodology, created within the FCSM, published its 1994 Statistical Policy Working Paper 22 (SPWP22). This paper, which was revised in 2005 by the Confidentiality and Data Access Committee (CDAC, 2005), sets good practice guidelines and recommendations for all agencies regarding confidentiality protection. Defining Confidentiality ...

show abstract

“…These include k-anonymity, assessing replicability of molecular data types, establishing formal access policies, implementing data use agreements and transparent informed consent procedures which specifically address future use of data, putting in place procedures for redress in the unlikely event of a data security breach, audits, and varying levels of access for personnel. 41,47 Although there is a public fear of individual reidentification through deidentified research data sets, it should be borne in mind that an attacker wishing to identify an individual still requires an identified DNA sample. Research data sets from biobanks may be a potential source of data for an attacker; however, the question has been raised as to why, other than to prove that it is possible, an attacker would use a sample to determine whether or not an individual's DNA was in a research dataset.…”

mentioning

confidence: 99%

Biobank networking for dissemination of data and resources: an overview

Meir¹,

Cohen²,

Mee³

et al. 2014

BSAM

View full text Add to dashboard Cite

Abstract:In response to the increasing global demand for high quality biospecimens and data for biomedical research, biobanking is rapidly gaining popularity as an efficient and user-friendly platform for translational research. The advent of increasingly sophisticated technologies for specimen and data analysis, in the face of growing economic pressures, are converging to encourage consolidation, centralization, and harmonization of biobanks into networks. Several types of biobank networks exist worldwide. Individuals involved in network establishment and day-to-day function hail from varying disciplines, including health care, academia, information technology, and the pharmaceutical industry. However, they may work together within and between networks to enhance the rapid progression of patient/donor-centered research through standardization of procedures and robust quality management systems. Regularly updated standards, policies, and guidelines are published by large biobanking organizations and made available to biobankers and those interested in biospecimen science. A biobank network's ability to reliably disseminate specimens and data depends on a variety of factors, including a well stocked inventory, a robust information technology platform, and adequate support, including the goodwill of collectors who supply specimens, and of end-users who return experimental data to the network. High quality experimental data may be recycled, thus accelerating biomarker discovery. Access to large amounts of personal data, however, carries risk, and ethical issues surrounding data protection are of paramount importance. All biobank networks require data security measures in keeping with local ethical and legal requirements. Return of results to individual donors is another emerging ethical and administrative challenge for biobank networks as technology steadily increases the overlap between research and patient care. Finally, as the bioresource impact factor concept is further developed, and as more scientific journals require biospecimen and source details in submitted manuscripts, biobank networks will be securely established as an essential platform for biomedical research.

show abstract

A Globally Optimal k-Anonymity Method for the De-Identification of Health Data

Abstract: For the de-identification of health datasets, OLA is an improvement on existing k-anonymity algorithms in terms of information loss and performance.

Cited by 198 publications

References 33 publications

SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

Avoiding Disclosure of Individually Identifiable Health Information

Biobank networking for dissemination of data and resources: an overview

Contact Info

Product

Resources

About