Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

Emam, Khaled El; Mosquera, Lucy; Fang, Xi; El‐Hussuna, Alaa

doi:10.2196/35734

Cited by 29 publications

(17 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is appealing given that previous results have shown that sequential synthesis can have good utility for oncology clinical trial data 11 and for observational datasets. 82 …”

Section: Discussionmentioning

confidence: 99%

“…For a specific dataset, it is possible to ensure that the membership disclosure risk is acceptably small by incorporating the

metric in a risk-utility loss during hyperparameter tuning of the generative model while it is being trained. The following loss metric can be used:

where

is some validated utility metric 82 and

are Iverson brackets. This loss proportionally penalizes the utility if the membership disclosure is above the 0.2 threshold using a sigmoid function.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Validating a membership disclosure metric for synthetic health data

Emam

Mosquera

Fang³

2022

JAMIA Open

Self Cite

View full text Add to dashboard Cite

Background One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter. Objective Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation. Materials and methods We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack. Results The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets. Conclusions Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.

show abstract

“…This is appealing given that previous results have shown that sequential synthesis can have good utility for oncology clinical trial data 11 and for observational datasets. 82 …”

Section: Discussionmentioning

confidence: 99%

“…For a specific dataset, it is possible to ensure that the membership disclosure risk is acceptably small by incorporating the

metric in a risk-utility loss during hyperparameter tuning of the generative model while it is being trained. The following loss metric can be used:

where

is some validated utility metric 82 and

are Iverson brackets. This loss proportionally penalizes the utility if the membership disclosure is above the 0.2 threshold using a sigmoid function.…”

Section: Discussionmentioning

confidence: 99%

Validating a membership disclosure metric for synthetic health data

Emam

Mosquera

Fang³

2022

JAMIA Open

Self Cite

View full text Add to dashboard Cite

show abstract

“…This metric is bound between 0 and 1 and hence is an easily interpreted generic measure of overall similarity of the multivariate distribution between the real and synthetic datasets. This metric has also been shown to be highly predictive of synthetic data utility for logistic regression analyses [ 95 ].…”

Section: Methodsmentioning

confidence: 99%

A method for generating synthetic longitudinal health data

Mosquera

Emam

Ding

et al. 2023

BMC Med Res Methodol

Self Cite

View full text Add to dashboard Cite

Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administrative health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our synthetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances.

show abstract

“…In our study, we further expand the approach, by assessing whether the model trained on the original data is able to properly describe the synthetic data (condition D in the analysis of classification performance, subsection V-A). A recent study by El Emam et al [13] investigates the ability of a variety of utility metrics in evaluating 30 different health datasets and 3 different synthetic data generation methods including Bayesian networks, GANs, and sequential tree synthesis. According to the authors, the HD is the metric that best ranks the synthetic data generation methods based on prediction performance.…”

Section: Related Literaturementioning

confidence: 99%

Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models

Lenatti

Paglialonga

Orani

et al. 2023

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

The aim of this study is to apply and characterize eXplainable AI (XAI) to assess the quality of synthetic health data generated using a data augmentation algorithm. In this exploratory study, several synthetic datasets are generated using various configurations of a conditional Generative Adversarial Network (GAN) from a set of 156 observations related to adult hearing screening. A rule-based native XAI algorithm, the Logic Learning Machine, is used in combination with conventional utility metrics. The classification performance in different conditions is assessed: models trained and tested on synthetic data, models trained on synthetic data and tested on real data, and models trained on real data and tested on synthetic data. The rules extracted from real and synthetic data are then compared using a rule similarity metric. The results indicate that XAI may be used to assess the quality of synthetic data by (i) the analysis of classification performance and (ii) the analysis of the rules extracted on real and synthetic data (number, covering, structure, cut-off values, and similarity). These results suggest that XAI can be used in an original way to assess synthetic health data and extract knowledge about the mechanisms underlying the generated data.

show abstract

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

Cited by 29 publications

References 51 publications

Validating a membership disclosure metric for synthetic health data

Validating a membership disclosure metric for synthetic health data

A method for generating synthetic longitudinal health data

Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models

Contact Info

Product

Resources

About