Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography

Quick, Harrison; Holan, Scott H.; Wikle, Christopher K.; Reiter, Jerome P.

doi:10.1016/j.spasta.2015.07.008

Cited by 22 publications

(38 citation statements)

References 40 publications

(48 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One approach to do so would be that of Quick et al . (), which uses log‐Gaussian Cox processes (LGCPs) (Møller et al ., ) with an underlying spatial structure to model home addresses within a marked point process model. By doing so, the synthesizer of Quick et al .…”

Section: Discussionmentioning

confidence: 99%

“…For the data in distribution (1), this would not only require generating synthetic n × 1 vectors Y †.l/ , but it would also require constructing a model from which to generate a collection of synthetic locations S †.l/ and any other individual level attributes. As described in Wang and Reiter (2012) and Quick et al (2015), however, approaches for generating fully synthetic point-referenced data sets can be quite computationally burdensome. Thus, in some instances, it may be attractive to take a partially synthetic approach in which only a collection of values or variables are replaced with imputed values (e.g.…”

Section: Methods For Statistical Disclosure Avoidancementioning

confidence: 99%

“…As described in Wang and Reiter () and Quick et al . (), however, approaches for generating fully synthetic point‐referenced data sets can be quite computationally burdensome. Thus, in some instances, it may be attractive to take a partially synthetic approach in which only a collection of values or variables are replaced with imputed values (e.g.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk by Using Differential Smoothing

Quick

Holan

Wikle

2018

Journal of the Royal Statistical Society Series A: Statistics in Society

Self Cite

View full text Add to dashboard Cite

Summary When collecting geocoded confidential data with the intent to disseminate, agencies often resort to altering the geographies before making data publicly available. An alternative to releasing aggregated and/or perturbed data is to release synthetic data, where sensitive values are replaced with draws from models designed to capture distributional features in the data collected. The issues associated with spatially outlying observations in the data, however, have received relatively little attention. Our goal here is to shed light on this problem, to propose a solution—referred to as ‘differential smoothing’—and to illustrate our approach by using sale prices of homes in San Francisco.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methods For Statistical Disclosure Avoidancementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk by Using Differential Smoothing

Quick

Holan

Wikle

2018

Journal of the Royal Statistical Society Series A: Statistics in Society

Self Cite

View full text Add to dashboard Cite

show abstract

“…In order to evaluate both the disclosure risk and the utility of the suppressed data, we consider the use of synthetic data (e.g. Little, ; Kennickell, ; Reiter, ; Quick et al, ). Specifically, we let θ = ( β 0 , Z T , ϕ T , σ 2 , τ 2 ) T and generate synthetic values,

Y_{i}^{*}

, using the posterior predictive distribution,

\begin{matrix} [Y_{i}^{*} | Y] & = \int [Y_{i}^{*} | θ, Y] \times [θ | Y] d θ \\ = \int Pois (Y_{i}^{*} | n_{i} \exp \{β_{0} + Z_{i} + ϕ_{i}\}) \times [β_{0}, Z, ϕ, σ^{2}, τ^{2} | Y] d β_{0} d Z d ϕ d σ^{2} d τ^{2} \end{matrix}

based on the hierarchical model in .…”

Section: Methodsmentioning

confidence: 99%

Zeros and ones: a case for suppressing zeros in sensitive count data with an application to stroke mortality

Quick

Holan

Wikle

2015

Stat

Self Cite

View full text Add to dashboard Cite

In the current era of global internet connectivity, privacy concerns are of the utmost importance. When official statistical agencies collect spatially referenced, confidential data that they intend to release as public-use files, the suppression of small counts is a common measure that agencies take to protect the confidentiality of the datasubjects from ill-intentioned users. The goal of this paper is to demonstrate that an interval suppression criterion that does not suppress zeros can fail to protect regions with a single occurrence. We illustrate the difference in disclosure risk between an interval suppression criterion and a one-sided suppression criterion by considering a US county-level dataset composed of the number of deaths due to stroke in White men. Here, we illustrate that an interval suppression criterion leads to a twofold increase in the disclosure risk when compared with a one-sided suppression criterion for regions with a single incidence among a population of less than 600. We conclude with an extension of these findings beyond stroke mortality and by offering general guidelines for data suppression.

show abstract

“…For example, a variance-covariance matrix might be used to generate new data that serves as a proxy for the original data. Research into 'spatial data synthesizers', such as Quick et al (2015), would thus be enormously beneficial. The US Census Bureau has started publishing synthetic individual level data based on highly sensitive administrative data from agencies like the IRS and the Social Security Administration (Bureau 2014).…”

Section: Reproducible Publications Using Workflow Modelsmentioning

confidence: 99%

Establishing a framework for Open Geographic Information science

Singleton

Spielman

Brunsdon

2016

International Journal of Geographical Information Science

View full text Add to dashboard Cite

When conducting research within a framework of Geographic Information Science (GISc), the scientific validity of this work can be argued as highly dependent upon the extent to which the methods employed are reproducible, and that, in the strictest sense, can only be fully achieved by implementing transparent workflows that utilize both open source software and openly available data. After considering the scientific implications of non-reproducible methods, we provide a review of both open source Geographic Information Systems (GIS) and openly available data, before describing an integrated model for Open GISc. We conclude with a critical review of this embryonic paradigm, with directions for future development in supporting spatial data infrastructure. ARTICLE HISTORY

show abstract

Bayesian marked point process modeling for generating fully synthetic public use data with point-referenced geography

Cited by 22 publications

References 40 publications

Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk by Using Differential Smoothing

Generating Partially Synthetic Geocoded Public Use Data with Decreased Disclosure Risk by Using Differential Smoothing

Zeros and ones: a case for suppressing zeros in sensitive count data with an application to stroke mortality

Establishing a framework for Open Geographic Information science

Contact Info

Product

Resources

About