In healthcare, there is a vast amount of patients' data, which can lead to important discoveries if combined. Due to legal and ethical issues, such data cannot be shared and hence such information is underused. A new area of research has emerged, called privacy preserving data publishing (PPDP), which aims in sharing data in a way that privacy is preserved while the information lost is kept at a minimum. In this Letter, a new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS). The algorithm uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k-anonymity. Five datasets from different areas of life sciences [RETINOPATHY, Single Proton Emission Computed Tomography imaging, gene sequencing and drug discovery (two datasets)], were anonymised with kPB-MS. The produced anonymised datasets were evaluated using four different classifiers and in 74% of the test cases, they produced similar or better accuracies than using the full datasets.
Electronic Health Records (EHRs) contain an increasing wealth of medical information. They have the potential to help significantly in advancing medical research, as well as improve health policies, providing society with additional benefits. However, the European healthcare information space is fragmented due to the lack of legal and technical standards, cost effective platforms, and sustainable business models. The vision of Linked2Safety is to advance clinical practice and accelerate medical research, by providing pharmaceutical companies, healthcare professionals and patients with an innovative secure semantic interoperability framework facilitating the efficient and homogenized access to anonymised distributed EHRs in an aggregate form that enables merging multiple data sources into a single analyses. In this paper a first public introduction to the project is provided along with a clear definition of the problems, and proposed architecture. Three usage scenarios are used to demonstrate the potential impact of the outcomes of the project.
The key test for confidence in any association discovered within the medical domain is replication testing. That is, the ability of the association to be detected in independent populations. At the same time, in order to increase the likelihood of discovering statistically significant associations there is a clear need to increase the statistical power of any given study. A key methodology for increasing statistical power is through the use of as many subjects as possible that match a study's inclusion criteria. Thus many have attempted to merge data from multiple independent sources/sites/studies that contain the same inclusion criteria for subjects as a way of creating a much larger study with significantly more statistical power. For these approaches to work though data from multiple sites need to be made available to a single analysis. This practice is significantly limited by the need to respect legal and ethical requirements that are often complicated, ambiguous and inconsistent across different countries. The common approach to achieve merging of data is by sharing aggregated data rather than subject's personal data. Aggregated data however may still in some cases be reverse engineered, therefore traditionally cells within the aggregated data with small values were suppressed, and some or all of the aggregated data were perturbed in order to add noise inhibiting any attempts at identifying personal information of a specific person or subgroup in the original data. In this paper we study the effects of cell-suppression and perturbation on the results of the data analysis. Each approach is looked at by itself as well as in combination using the typical settings documented in the literature. The tests are based on a real dataset that looks for associations between phenotypes and genetic markers. This work is part of the Linked2Safety project that aims to dynamically interconnect distributed patients' data to better enable medical research efforts, whilst respecting patients' anonymity, as well as European and national legislation.
Abstract-Several machine learning techniques have been applied for finding multi-loci associations among Single Nucleotide Polymorphisms (SNPs) and a disease. In this paper it is investigated whether Self Organizing Maps (SOMs) can generate clusters associated with a disease based on the genetic patterns of subjects. A batch categorical SOM that can handle missing data was used on Genome Wide Association (GWA) data on Multiple Sclerosis (MS). The association of the clusters generated with the disease were initially tested using the Pearson's chi square test and then the weights of the top clusters were used for investigating for SNP patterns. The results of the analyses reveal statistically significant associations between the generated clusters and the disease, indicating that SOMs can be used for multi-loci associations.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.