Simulation of Synthetic Complex Data: The <i>R</i> Package <b>simPop</b>

Templ, Matthias; Meindl, Bernhard; Kowarik, Alexander; Dupriez, Olivier

doi:10.18637/jss.v079.i10

Cited by 55 publications

(40 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Finally, we note that several open-source software packages exist for synthetic data generation. Recent examples include the R packages synthpop [30] and SimPop [31], the Python package DataSynthesizer [5], and the Java-based simulator Synthea [7].…”

Section: Related Workmentioning

confidence: 99%

Generation and evaluation of synthetic patient data

Gonçalves

Ray

Soper

et al. 2020

BMC Med Res Methodol

198

157

View full text Add to dashboard Cite

Background: Machine learning (ML) has made a significant impact in medicine and cancer research; however, its impact in these areas has been undeniably slower and more limited than in other application domains. A major reason for this has been the lack of availability of patient data to the broader ML research community, in large part due to patient privacy protection concerns. High-quality, realistic, synthetic datasets can be leveraged to accelerate methodological developments in medicine. By and large, medical data is high dimensional and often categorical. These characteristics pose multiple modeling challenges. Methods: In this paper, we evaluate three classes of synthetic data generation approaches; probabilistic models, classification-based imputation models, and generative adversarial neural networks. Metrics for evaluating the quality of the generated synthetic datasets are presented and discussed. Results: While the results and discussions are broadly applicable to medical data, for demonstration purposes we generate synthetic datasets for cancer based on the publicly available cancer registry data from the Surveillance Epidemiology and End Results (SEER) program. Specifically, our cohort consists of breast, respiratory, and non-solid cancer cases diagnosed between 2010 and 2015, which includes over 360,000 individual cases. Conclusions: We discuss the trade-offs of the different methods and metrics, providing guidance on considerations for the generation and usage of medical synthetic data.

show abstract

Section: Related Workmentioning

confidence: 99%

Generation and evaluation of synthetic patient data

Gonçalves

Ray

Soper

et al. 2020

BMC Med Res Methodol

198

157

View full text Add to dashboard Cite

show abstract

“…We select for this simulation study the adults of eusilcP data set (population size 47,123), a population available from the R-packages simFrame [38] and simPop [39]. As a panel, we draw a sample of 1500 households stratified by region and select all members in each household.…”

Section: Simulation Studymentioning

confidence: 99%

Treating Nonresponse in Probability-Based Online Panels through Calibration: Empirical Evidence from a Survey of Political Decision-Making Procedures

2020

View full text Add to dashboard Cite

The use of probability-based panels that collect data via online or mixed-mode surveys has increased in the last few years as an answer to the growing concern with the quality of the data obtained with traditional survey modes. However, in order to adequately represent the general population, these tools must address the same sources of bias that affect other survey-based designs: namely under coverage and non-response. In this work, we test several approaches to produce calibration estimators that are suitable for survey data affected by non response where auxiliary information exists at both the panel level and the population level. The first approach adjusts the results obtained in the cross-sectional survey to the population totals, while, in the second, the weights are the result of two-step process where different adjusts on the sample, panel, and population are done. A simulation on the properties of these estimators is performed. In light of theory and simulation results, we conclude that weighting by calibration is an effective technique for the treatment of non-response bias when the response mechanism is missing at random. These techniques have also been applied to real data from the survey Andalusian Citizen Preferences for Political Decision-Making Procedures.

show abstract

“…These manually created EA-level urban household type probabilities were multiplied by the predicted household type probability surfaces created in step 3 to create the final 100 m × 100 m household type probability surfaces. Fifth, we simulated a population of realistic households in Oshikoto using the 20% census microdata sample and multinomial logistic regression techniques proposed by Alfons and colleagues (2011) and operationalized by Templ and colleagues (2017) in the R simPop package [7,15]. In this Data 2018, 3, 30 9 of 19 approach, we first calculated the proportion of households to simulate per household-size, per stratum (defined by constituency and urban/rural boundary).…”

Section: Phase A: Predict Spatial Distribution Of Household Typesmentioning

confidence: 99%

“…Combinatorial optimization procedures, such as simulated annealing (SA) [13] or quota sampling [14], can also be used to prevent sub-optimal combinations of attributes in the simulated dataset. Templ and colleagues discuss a model-based approach to simulation of individual or household attributes with regression models, which they implement in an open-source software [15]. Agent-based models (ABMs) can also produce a realistic count of individuals, or "agents", along with key attributes and relationships [16,17].…”

mentioning

confidence: 99%

Linking Synthetic Populations to Household Geolocations: A Demonstration in Namibia

Thomson

Kools

Jochem

2018

Data

View full text Add to dashboard Cite

Whether evaluating gridded population dataset estimates (e.g., WorldPop, LandScan) or household survey sample designs, a population census linked to residential locations are needed. Geolocated census microdata data, however, are almost never available and are thus best simulated. In this paper, we simulate a close-to-reality population of individuals nested in households geolocated to realistic building locations. Using the R simPop package and ArcGIS, multiple realizations of a geolocated synthetic population are derived from the Namibia 2011 census 20% microdata sample, Namibia census enumeration area boundaries, Namibia 2013 Demographic and Health Survey (DHS), and dozens of spatial covariates derived from publicly available datasets. Realistic household latitude-longitude coordinates are manually generated based on public satellite imagery. Simulated households are linked to latitude-longitude coordinates by identifying distinct household types with multivariate k-means analysis and modelling a probability surface for each household type using Random Forest machine learning methods. We simulate five realizations of a synthetic population in Namibia’s Oshikoto region, including demographic, socioeconomic, and outcome characteristics at the level of household, woman, and child. Comparison of variables in the synthetic population were made with 2011 census 20% sample and 2013 DHS data by primary sampling unit/enumeration area. We found that synthetic population variable distributions matched observed observations and followed expected spatial patterns. We outline a novel process to simulate a close-to-reality microdata census geolocated to realistic building locations in a low- or middle-income country setting to support spatial demographic research and survey methodological development while avoiding disclosure risk of individuals.

show abstract

Simulation of Synthetic Complex Data: The R Package simPop

Cited by 55 publications

References 29 publications

Generation and evaluation of synthetic patient data

Generation and evaluation of synthetic patient data

Treating Nonresponse in Probability-Based Online Panels through Calibration: Empirical Evidence from a Survey of Political Decision-Making Procedures

Linking Synthetic Populations to Household Geolocations: A Demonstration in Namibia

Contact Info

Product

Resources

About