2017
DOI: 10.18637/jss.v079.i10
|View full text |Cite
|
Sign up to set email alerts
|

Simulation of Synthetic Complex Data: The R Package simPop

Abstract: The production of synthetic datasets has been proposed as a statistical disclosure control solution to generate public use files out of protected data, and as a tool to create "augmented datasets" to serve as input for micro-simulation models. Synthetic data have become an important instrument for ex-ante assessments of policy impact. The performance and acceptability of such a tool relies heavily on the quality of the synthetic populations, i.e., on the statistical similarity between the synthetic and the tru… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 55 publications
(40 citation statements)
references
References 29 publications
0
38
0
1
Order By: Relevance
“…Finally, we note that several open-source software packages exist for synthetic data generation. Recent examples include the R packages synthpop [30] and SimPop [31], the Python package DataSynthesizer [5], and the Java-based simulator Synthea [7].…”
Section: Related Workmentioning
confidence: 99%
“…Finally, we note that several open-source software packages exist for synthetic data generation. Recent examples include the R packages synthpop [30] and SimPop [31], the Python package DataSynthesizer [5], and the Java-based simulator Synthea [7].…”
Section: Related Workmentioning
confidence: 99%
“…We select for this simulation study the adults of eusilcP data set (population size 47,123), a population available from the R-packages simFrame [38] and simPop [39]. As a panel, we draw a sample of 1500 households stratified by region and select all members in each household.…”
Section: Simulation Studymentioning
confidence: 99%
“…These manually created EA-level urban household type probabilities were multiplied by the predicted household type probability surfaces created in step 3 to create the final 100 m × 100 m household type probability surfaces. Fifth, we simulated a population of realistic households in Oshikoto using the 20% census microdata sample and multinomial logistic regression techniques proposed by Alfons and colleagues (2011) and operationalized by Templ and colleagues (2017) in the R simPop package [7,15]. In this Data 2018, 3, 30 9 of 19 approach, we first calculated the proportion of households to simulate per household-size, per stratum (defined by constituency and urban/rural boundary).…”
Section: Phase A: Predict Spatial Distribution Of Household Typesmentioning
confidence: 99%
“…Combinatorial optimization procedures, such as simulated annealing (SA) [13] or quota sampling [14], can also be used to prevent sub-optimal combinations of attributes in the simulated dataset. Templ and colleagues discuss a model-based approach to simulation of individual or household attributes with regression models, which they implement in an open-source software [15]. Agent-based models (ABMs) can also produce a realistic count of individuals, or "agents", along with key attributes and relationships [16,17].…”
mentioning
confidence: 99%