2022
DOI: 10.2196/35734
|View full text |Cite
|
Sign up to set email alerts
|

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

Abstract: Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 29 publications
(17 citation statements)
references
References 51 publications
0
14
0
Order By: Relevance
“…This is appealing given that previous results have shown that sequential synthesis can have good utility for oncology clinical trial data 11 and for observational datasets. 82 …”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…This is appealing given that previous results have shown that sequential synthesis can have good utility for oncology clinical trial data 11 and for observational datasets. 82 …”
Section: Discussionmentioning
confidence: 99%
“…For a specific dataset, it is possible to ensure that the membership disclosure risk is acceptably small by incorporating the metric in a risk-utility loss during hyperparameter tuning of the generative model while it is being trained. The following loss metric can be used: where is some validated utility metric 82 and are Iverson brackets. This loss proportionally penalizes the utility if the membership disclosure is above the 0.2 threshold using a sigmoid function.…”
Section: Discussionmentioning
confidence: 99%
“…This metric is bound between 0 and 1 and hence is an easily interpreted generic measure of overall similarity of the multivariate distribution between the real and synthetic datasets. This metric has also been shown to be highly predictive of synthetic data utility for logistic regression analyses [ 95 ].…”
Section: Methodsmentioning
confidence: 99%
“…In our study, we further expand the approach, by assessing whether the model trained on the original data is able to properly describe the synthetic data (condition D in the analysis of classification performance, subsection V-A). A recent study by El Emam et al [13] investigates the ability of a variety of utility metrics in evaluating 30 different health datasets and 3 different synthetic data generation methods including Bayesian networks, GANs, and sequential tree synthesis. According to the authors, the HD is the metric that best ranks the synthetic data generation methods based on prediction performance.…”
Section: Related Literaturementioning
confidence: 99%