Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms

Rachel, Heyburn,; Bond, Raymond; Black, Michaela; Mulvenna, Maurice; Wallace, Jonathan; Rankin, Debbie; Cleland, Brian

doi:10.1142/9789813273238_0160

Cited by 24 publications

(15 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While a number of synthetic data generators have been developed, empirical evidence of their efficacy has not been fully explored. This work extends a preliminary study [ 18 ] and investigates whether fully synthetic data can preserve the hidden complex patterns supervised machine learning can uncover from real data and therefore whether it can be used as a valid alternative to real data when developing eHealth apps and health care policy making solutions. This will be achieved by experimenting with a range of open health care datasets.…”

Section: Introductionmentioning

confidence: 72%

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Rankin¹,

Black²,

Bond³

et al. 2020

JMIR Med Inform

Self Cite

102

View full text Add to dashboard Cite

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.

show abstract

Section: Introductionmentioning

confidence: 72%

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Rankin¹,

Black²,

Bond³

et al. 2020

JMIR Med Inform

Self Cite

102

View full text Add to dashboard Cite

show abstract

“…Narrow or specific measures are widely used for assessing synthetic data [15], [19], [20], [27], [31], [32]. They are useful when the analysis to be performed on the synthetic data is known ahead of time.…”

Section: B Utility Metrics: Overview and Classificationmentioning

confidence: 99%

“…We chose classification as it is a popular tool for synthetic data evaluation. On the other hand, one of the objectives of this investigation is to evaluate whether the other three dimensions of quality are good predictors of application-level fidelity [15], [19], [20], [44].…”

Section: ) Application Fidelitymentioning

confidence: 99%

“…Few research papers tried to investigate the utility of synthetic data generators [15]- [20]. They do so either by measuring a chosen statistical distance between the original and synthesized datasets [16], [21], or, more commonly, by measuring the differences in specific models between original and released data [15], [17], [19], [20]. The choice of the measures/models is guided by the application of interest and the provided conclusions apply to that specific context.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-Dimensional Evaluation of Synthetic Data Generators

2022

View full text Add to dashboard Cite

show abstract

“…Such an evaluation involves comparing the performance metrics of predictive models trained on synthetic and on real data (called as model compatibility). This performance of a machine learning models trained and tested on real and or synthetic data is compared based on different scenarios [12,14,18]: Train on Real and Test on Synthetic data (T RT S) Train on Synthetic and Test on Real (T ST R), Train on Real, Test on Real (T RT R) and Train on Synthetic, Test on Synthetic (T ST S), and lastly trained and tested on a mixture of real and synthetic data (T MT M). In principle, these scenarios are transferable to the evaluation of synthetic data in recommender systems.…”

Section: Reliable Evaluationmentioning

confidence: 99%

Doing Data Right: How Lessons Learned Working with Conventional Data should Inform the Future of Synthetic Data for Recommender Systems

Slokom,

Larson

2021

Preprint

View full text Add to dashboard Cite

We present a case that the newly emerging field of synthetic data in the area of recommender systems should prioritize 'doing data right'. We consider this catchphrase to have two aspects: First, we should not repeat the mistakes of the past, and, second, we should explore the full scope of opportunities presented by synthetic data as we move into the future. We argue that explicit attention to dataset design and description will help to avoid past mistakes with dataset bias and evaluation. In order to fully exploit the opportunities of synthetic data, we point out that researchers can investigate new areas such as using data synthesize to support reproducibility by making data open, as well as FAIR, and to push forward our understanding of data minimization.

show abstract

Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms

Cited by 24 publications

References 5 publications

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

A Multi-Dimensional Evaluation of Synthetic Data Generators

Doing Data Right: How Lessons Learned Working with Conventional Data should Inform the Future of Synthetic Data for Recommender Systems

Contact Info

Product

Resources

About