BackgroundPrivacy legislation in most jurisdictions allows the disclosure of health data for secondary purposes without patient consent if it is de-identified. Some recent articles in the medical, legal, and computer science literature have argued that de-identification methods do not provide sufficient protection because they are easy to reverse. Should this be the case, it would have significant and important implications on how health information is disclosed, including: (a) potentially limiting its availability for secondary purposes such as research, and (b) resulting in more identifiable health information being disclosed. Our objectives in this systematic review were to: (a) characterize known re-identification attacks on health data and contrast that to re-identification attacks on other kinds of data, (b) compute the overall proportion of records that have been correctly re-identified in these attacks, and (c) assess whether these demonstrate weaknesses in current de-identification methods.Methods and FindingsSearches were conducted in IEEE Xplore, ACM Digital Library, and PubMed. After screening, fourteen eligible articles representing distinct attacks were identified. On average, approximately a quarter of the records were re-identified across all studies (0.26 with 95% CI 0.046–0.478) and 0.34 for attacks on health data (95% CI 0–0.744). There was considerable uncertainty around the proportions as evidenced by the wide confidence intervals, and the mean proportion of records re-identified was sensitive to unpublished studies. Two of fourteen attacks were performed with data that was de-identified using existing standards. Only one of these attacks was on health data, which resulted in a success rate of 0.00013.ConclusionsThe current evidence shows a high re-identification rate but is dominated by small-scale studies on data that was not de-identified according to existing standards. This evidence is insufficient to draw conclusions about the efficacy of de-identification methods.
BackgroundThe Internet and social media offer promising ways to improve the reach, efficiency, and effectiveness of recruitment efforts at a reasonable cost, but raise unique ethical dilemmas. We describe how we used social media to recruit cancer patients and family caregivers for a research study, the ethical issues we encountered, and the strategies we developed to address them.ObjectiveDrawing on the principles of Privacy by Design (PbD), a globally recognized standard for privacy protection, we aimed to develop a PbD framework for online health research recruitment.MethodsWe proposed a focus group study on the dietary behaviors of cancer patients and their families, and the role of Web-based dietary self-management tools. Using an established blog on our hospital website, we proposed publishing a recruitment post and sharing the link on our Twitter and Facebook pages. The Research Ethics Board (REB) raised concern about the privacy risks associated with our recruitment strategy; by clicking on a recruitment post, an individual could inadvertently disclose personal health information to third-party companies engaged in tracking online behavior. The REB asked us to revise our social media recruitment strategy with the following questions in mind: (1) How will you inform users about the potential for privacy breaches and their implications? and (2) How will you protect users from privacy breaches or inadvertently sharing potentially identifying information about themselves?ResultsEthical guidelines recommend a proportionate approach to ethics assessment, which advocates for risk mitigation strategies that are proportional to the magnitude and probability of risks. We revised our social media recruitment strategy to inform users about privacy risks and to protect their privacy, while at the same time meeting our recruitment objectives. We provide a critical reflection of the perceived privacy risks associated with our social media recruitment strategy and the appropriateness of the risk mitigation strategies that we employed by assessing their alignment with PbD and by discussing the following: (1) What are the potential risks and who is at risk? (2) Is cancer considered “sensitive” personal information? (3) What is the probability of online disclosure of a cancer diagnosis in everyday life? and (4) What are the public’s expectations for privacy online and their views about online tracking, profiling, and targeting? We conclude with a PbD framework for online health research recruitment.ConclusionsResearchers, REBs, ethicists, students, and potential study participants are often unaware of the privacy risks of social media research recruitment and there is no official guidance. Our PbD framework for online health research recruitment is a resource for these wide audiences.
BackgroundThere is limited capacity to assess the comparative risks of medications after they enter the market. For rare adverse events, the pooling of data from multiple sources is necessary to have the power and sufficient population heterogeneity to detect differences in safety and effectiveness in genetic, ethnic and clinically defined subpopulations. However, combining datasets from different data custodians or jurisdictions to perform an analysis on the pooled data creates significant privacy concerns that would need to be addressed. Existing protocols for addressing these concerns can result in reduced analysis accuracy and can allow sensitive information to leak.ObjectiveTo develop a secure distributed multi-party computation protocol for logistic regression that provides strong privacy guarantees.MethodsWe developed a secure distributed logistic regression protocol using a single analysis center with multiple sites providing data. A theoretical security analysis demonstrates that the protocol is robust to plausible collusion attacks and does not allow the parties to gain new information from the data that are exchanged among them. The computational performance and accuracy of the protocol were evaluated on simulated datasets.ResultsThe computational performance scales linearly as the dataset sizes increase. The addition of sites results in an exponential growth in computation time. However, for up to five sites, the time is still short and would not affect practical applications. The model parameters are the same as the results on pooled raw data analyzed in SAS, demonstrating high model accuracy.ConclusionThe proposed protocol and prototype system would allow the development of logistic regression models in a secure manner without requiring the sharing of personal health information. This can alleviate one of the key barriers to the establishment of large-scale post-marketing surveillance programs. We extended the secure protocol to account for correlations among patients within sites through generalized estimating equations, and to accommodate other link functions by extending it to generalized linear models.
BackgroundThere are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013.ObjectiveTo de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule.MethodsWe defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack.ResultsAn evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions.ConclusionsIt was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.