We develop metrics for measuring the quality of synthetic health data for both education and research. We use novel and existing metrics to capture a synthetic dataset's resemblance, privacy, utility and footprint. Using these metrics, we develop an end-to-end workflow based on our generative adversarial network (GAN) method, HealthGAN, that creates privacy preserving synthetic health data. Our workflow meets privacy specifications of our data partner: (1) the HealthGAN is trained inside a secure environment; (2) the HealthGAN model is used outside of the secure environment by external users to generate synthetic data. This second step facilitates data handling for external users by avoiding de-identification, which may require special user training, be costly, or cause loss of data fidelity. This workflow is compared against five other baseline methods. While maintaining resemblance and utility comparable to other methods, HealthGAN provides the best privacy and footprint. We present two case studies in which our methodology was put to work in the classroom and research settings. We evaluate utility in the classroom through a data analysis challenge given to students and in research by replicating three different medical papers with synthetic data. Data, code, and the challenge that we organized for educational purposes are available.
We propose a novel bootstrap procedure for dependent data based on Generative Adversarial networks (GANs). We show that the dynamics of common stationary time series processes can be learned by GANs and demonstrate that GANs trained on a single sample path can be used to generate additional samples from the process. We find that temporal convolutional neural networks provide a suitable design for the generator and discriminator, and that convincing samples can be generated on the basis of a vector of iid normal noise. We demonstrate the finite sample properties of GAN sampling and the suggested bootstrap using simulations where we compare the performance to circular block bootstrapping in the case of resampling an AR(1) time series processes. We find that resampling using the GAN can outperform circular block bootstrapping in terms of empirical coverage. * Acknowledgements: The authors gratefully acknowledge support from the Google Tensorflow Research Cloud (TFRC). PyTorch code for this paper is available on request. We also thank Giovanni Mellace and Peter Sandholt Jensen for useful comments.
Generating synthetic data represents an attractive solution for creating open data, enabling health research and education while preserving patient privacy. We reproduce the research outcomes obtained on two previously published studies, which used private health data, using synthetic data generated with a method that we developed, called HealthGAN. We demonstrate the value of our methodology for generating and evaluating the quality and privacy of synthetic health data. The dataset are from OptumLabs R Data Warehouse (OLDW). The OLDW is accessed within a secure environment and doesn't allow exporting of patient level data of any type of data, real or synthetic, therefore the HealthGAN exports a privacy-preserving generator model instead. The studies examine questions related to comorbidites of Autism Spectrum Disorder (ASD) using medical records of children with ASD and matched patients without ASD. HealthGAN generates high quality synthetic data that produce similar results while preserving patient privacy. By creating synthetic versions of these datasets that maintain privacy and achieve a high level of resemblance and utility, we create valuable open health data assets for future research and education efforts.
Through a series of theses about the current academic landscape, this manifesto rejects the naturalized assumption that such phenomena as administrative bloat, student debt, segregation, and privatization in higher education are an inexorable fact of involuntary market forces. In closing, the authors encourage readers to transform higher education through collective action.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.