Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Rankin, Debbie; Black, Michaela; Bond, Raymond; Wallace, Jonathan; Mulvenna, Maurice; Epelde, Gorka

doi:10.2196/18910

Cited by 100 publications

(96 citation statements)

References 40 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, the combination of the MIDAS developed GYDRA data preparation tool, alongside synthetic dataset generation strategies, can enable hospitals and healthcare providers, to: 1) refine and prepare their datasets (with the required metadata description), and; 2) share synthetically generated privacypreserving datasets with the scientific community, that follow statistical patterns similar to the real data, and have proven to be reliable for training machine learning models [28]. These mechanisms would enable users to load a controlled dataset into the MIDAS platform and to develop in-house analytics, whilst simultaneously allowing the scientific community to develop AI models based on synthetic datasets that can later be fed back to the policy-makers through the MIDAS platform.…”

Section: Ingesting Useful Open Data Sourcesmentioning

confidence: 99%

Meaningful Big Data Integration for a Global COVID-19 Strategy

Costa

Grobelnik

Fuart

et al. 2020

IEEE Comput. Intell. Mag.

Self Cite

View full text Add to dashboard Cite

With the rapid spread of the COVID-19 pandemic, the novel Meaningful Integration of Data Analytics and Services (MIDAS) platform quickly demonstrates its value, relevance and transferability to this new global crisis. The MIDAS platform enables the connection of a large number of isolated heterogeneous data sources, and combines rich datasets including open and social data, ingesting and preparing these for the application of analytics, monitoring and research tools. These platforms will assist public health authorities in: (i) better understanding the disease and its impact; (ii) monitoring the different aspects of the evolution of the pandemic across a diverse range of groups; (iii) contributing to improved resilience against the impacts of this global crisis; and (iv) enhancing preparedness for future public health emergencies. The model of governance and ethical review, incorporated and defined within MIDAS, also addresses the complex privacy and ethical issues that the developing pandemic has highlighted, allowing oversight and scrutiny of more and richer data sources by users of the system.

show abstract

Section: Ingesting Useful Open Data Sourcesmentioning

confidence: 99%

Meaningful Big Data Integration for a Global COVID-19 Strategy

Costa

Grobelnik

Fuart

et al. 2020

IEEE Comput. Intell. Mag.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Such a knowledge-based model depends on prior knowledge of the system, and how much we can intellect about it (Kim et al, 2017;Bonnéry et al, 2019). On one hand, theory-based modelling aims at understanding and offers interpretability, on the other when modelling complex systems, simplifications and assumptions are inevitable, leading to inaccuracies or reduced utility (Hand, 2019;Rankin et al, 2020). In fact, relying on population-level statistics does not produce models capable of reproducing heterogeneous health outcomes (Chen et al, 2019a).…”

Section: Synthetic Datamentioning

confidence: 99%

“…ehrGAN is developed for sequences of medical codes Che et al. It learns a transitional distribution, combining an Encoder-Decoder CNN (Rankin et al, 2020) with VCD . The ehrGAN generator is trained to decode a random vector mixed with the latent space representation of a real patient (See Panel 2).…”

Section: Semi-supervised Learningmentioning

confidence: 99%

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

Geogres-Filteau¹,

Cirillo

2020

Preprint

View full text Add to dashboard Cite

After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature of patient-related data and regulations to protect it.Generative Adversarial Networks (GANs) have recently emerged as a groundbreaking way to learn generative models that produce realistic synthetic data. They have revolutionized practices in multiple domains such as self-driving cars, fraud detection, digital twin simulations in industrial sectors, and medical imaging.The digital twin concept could readily apply to modelling and quantifying disease progression. In addition, GANs posses many capabilities relevant to common problems in healthcare: lack of data, class imbalance, rare diseases, and preserving privacy. Unlocking open access to privacy-preserving OHD could be transformative for scientific research. In the midst of COVID-19, the healthcare system is facing unprecedented challenges, many of which of are data related for the reasons stated above.Considering these facts, publications concerning GAN applied to OHD seemed to be severely lacking. To uncover the reasons for this slow adoption, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model were directly transferable) and the evaluation synthetic data lacked clear metrics.We find more publications on the subject than expected, starting slowly in 2017, and since then at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation, consistency, benchmarking, data modelling, and reproducibility.

show abstract

“…They bound their representational power to correlations intelligible to the modeler, being prone to obscure inaccuracies. SD generated by these models hits a ceiling of utility (Rankin et al, 2020). In the ML field, generative models learn an approximation of the multi-modal distribution, from which we can draw synthetic samples (Goodfellow et al, 2014).…”

Section: Synthetic Datamentioning

confidence: 99%

“…Having served its primary purpose, this wealth of detailed information can further benefit patient well-being by sustaining medical research and development. That is to say, improving the development life-cycle of Health Informatics (HI), the predictive accuracy of Machine Learning (ML) algorithms, or enabling discoveries in research on clinical decisions, triage decisions, inter-institution collaboration, and HI automation (Rudin et al, 2020;Rankin et al, 2020). Big health data is the underpinning of two prime objectives of precision medicine: individualization of patient interventions and inferring the workings of biological systems from high-level analysis (Capobianco, 2020).…”

Section: Introductionmentioning

confidence: 99%

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

Geogres-Filteau¹,

Cirillo

2020

Preprint

View full text Add to dashboard Cite

After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature of patient-related data and regulations to protect it. Generative Adversarial Networks (GANs) have recently emerged as a groundbreaking way to learn generative models that produce realistic synthetic data. They have revolutionized practices in multiple domains such as self-driving cars, fraud detection, digital twin simulations in industrial sectors, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In addition, GANs posses many capabilities relevant to common problems in healthcare: lack of data, class imbalance, rare diseases, and preserving privacy. Unlocking open access to privacy-preserving OHD could be transformative for scientific research. In the midst of COVID-19, the healthcare system is facing unprecedented challenges, many of which of are data related for the reasons stated above. Considering these facts, publications concerning GAN applied to OHD seemed to be severely lacking. To uncover the reasons for this slow adoption, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model were directly transferable) and the evaluation synthetic data lacked clear metrics. We find more publications on the subject than expected, starting slowly in 2017, and since then at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation, consistency, benchmarking, data modelling, and reproducibility.

show abstract

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Cited by 100 publications

References 40 publications

Meaningful Big Data Integration for a Global COVID-19 Strategy

Meaningful Big Data Integration for a Global COVID-19 Strategy

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

Contact Info

Product

Resources

About