Protection against disclosure is a legal and ethical obligation for agencies releasing microdata files for public use. Consider a microdata sample of size n from a finite population of size n = n + λn, with λ > 0, such that each sample record contains two disjoint types of information: identifying categorical information and sensitive information. Any decision about releasing data is supported by the estimation of measures of disclosure risk, which are defined as discrete functionals of the number of sample records with a unique combination of values of identifying variables. The most common measure is arguably the number τ1 of sample unique records that are population uniques. In this paper, we first study nonparametric estimation of τ1 under the Poisson abundance model for sample records. We introduce a class of linear estimators of τ1 that are simple, computationally efficient and scalable to massive datasets, and we give uniform theoretical guarantees for them. In particular, we show that they provably estimate τ1 all of the way up to the sampling fraction (λ + 1) −1 ∝ (log n) −1 , with vanishing normalized mean-square error (NMSE) for large n. We then establish a lower bound for the minimax NMSE for the estimation of τ1, which allows us to show that: i) (λ+1) −1 ∝ (log n) −1 is the smallest possible sampling fraction for consistently estimating τ1; ii) estimators' NMSE is near optimal, in the sense of matching the minimax lower bound, for large n. This is the main result of our paper, and it provides a rigorous answer to an open question about the feasibility of nonparametric estimation of τ1 under the Poisson abundance model and for a sampling fraction (λ + 1) −1 < 1/2.
Graphex processes resolve some pathologies in traditional random graph models, notably, providing models that are both projective and allow sparsity. Most of the literature on graphex processes study them from a probabilistic point of view. Techniques for inferring the parameter of these processes -the so-called graphon -are still marginal; exceptions are a few papers considering parametric families of graphons. Nonparametric estimation remains unconsidered. In this paper, we propose estimators for a selected choice of functionals of the graphon. Our estimators originate from the subsampling theory for graphex processes, hence can be seen as a form of bootstrap procedure.
In this article, we present some specific aspects of symmetric Gamma process mixtures for use in regression models. We propose a new Gibbs sampler for simulating the posterior and we establish adaptive posterior rates of convergence related to the Gaussian mean regression problem.
Given n samples from a population of individuals belonging to different species, what is the number U of hitherto unseen species that would be observed if λn new samples were collected? This is an important problem in many scientific endeavors, and it has been the subject of recent breakthrough studies leading to minimax near-optimal estimation of U and consistency all the way up to λ ≍ log n. These studies do not rely on assumptions on the underlying unknown distribution p of the population, and therefore, while providing a theory in its greatest generality, worst case distributions may severely hamper the estimation of U in concrete applications. Motivated by the ubiquitous power-law type distributions, which nowadays occur in many natural and social phenomena, in this paper we consider the problem of estimating U under the assumption that p has regularly varying tails of index α ∈ (0, 1). First, we introduce an estimator of U that is simple, linear in the sampling information, computationally efficient and scalable to massive datasets. Then, uniformly over the class of regularly varying tail distributions, we show that our estimator has the following provable guarantees: i) it is minimax near-optimal, up to a power of log n factor; ii) it is consistent all of the way up to log λ ≍ n α/2 / √ log n, and this range is the best possible. This work presents the first study on the estimation of the unseen under regularly varying tail distributions. Our results rely on a novel approach, of independent interest, which is based on Bayesian arguments under Poisson-Kingman priors for the unknown regularly varying tail p. A numerical illustration is presented for several synthetic and real data, showing that our method outperforms existing ones.
Graphex processes resolve some pathologies in traditional random graph models, notably, providing models that are both projective and allow sparsity. In a recent paper, Caron and Rousseau (2017) show that for a large class of graphex models, the sparsity behaviour is governed by a single parameter: the tail-index of the function (the graphon) that parameterizes the model. We propose an estimator for this parameter and quantify its risk. Our estimator is a simple, explicit function of the degrees of the observed graph. In many situations of practical interest, the risk decays polynomially in the size of the observed graph. We illustrate the importance of a good estimator for the tail-index through the graph analogue of the unseen species problem. We also derive the analogous results for the bipartite graphex processes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.