BACKGROUND A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. OBJECTIVE This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. METHODS We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. RESULTS The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. CONCLUSIONS This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
Background A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods. Objective This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research. Methods We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models. Results The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions. Conclusions This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.
Background One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. Results The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. Conclusions Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.
1554 Background: There is strong interest by researchers, the pharmaceutical industry, medical journal editors, funders of research, and regulators in sharing clinical trial data. Reusing data extracts the most utility possible from patient contributions. The majority of patients do want to share their data for secondary research purposes. However, data access for secondary analysis remains a challenge. A key reason why individual-level data is not made directly available to data users by authors and data custodians is concern over breaches of patient privacy. Synthetic data generation (SDG) is an effective way to address privacy concerns that can enable the broader sharing of clinical trial datasets. However, a key question is whether the reproducibility of the generated data is adequate to draw reliable conclusions. Methods: We synthesized datasets from five pragmatic breast cancer clinical trials performed by the REaCT group (https://react.ohri.ca/). A sequential synthesis method, a type of machine learning was performed. The published analysis of each trial was repeated on each synthetic dataset to evaluate reproducibility. We evaluated reproducibility on three criteria: (a) decision agreement: the direction and statistical significance of the primary endpoint effect estimates are the same as the real data, (b) estimate agreement: the parameter estimates from the synthetic data are within the 95% confidence interval of the real data, and (c) the confidence interval overlap between real and synthetic parameters is above 50%. In addition, we evaluated privacy using a membership disclosure metric. This evaluates the ability of an adversary to determine that a target individual was in the original dataset using the synthetic data, computed as an F1 classification accuracy score. Results: Our results show that decision and estimate agreements held true across all five trials, and the confidence interval overlap was high. The risks of membership disclosure are all below the established 0.2 threshold. Conclusions: In this study, we were able to successfully generate synthetic datasets that accurately replicated original data from 5 oncology trials and yielded the same results as in the original published studies, with a very low risk of membership disclosure. With proper modeling techniques, synthetic datasets can play a key role in data democratization and the reuse of oncology clinical trials.[Table: see text]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.