The goal of statistical matching is the estimation of a joint distribution having observed only samples from its marginals. The lack of joint observations on the variables of interest is the reason of uncertainty about the joint population distribution function. In the present article, the notion of matching error is introduced, and upper-bounded via an appropriate measure of uncertainty. Then, an estimate of the distribution function for the variables not jointly observed is constructed on the basis of a modification of the conditional independence assumption in the presence of logical constraints. The corresponding measure of uncertainty is estimated via sample data. Finally, a simulation study is performed, and an application to a real case is provided. Supplementary materials for this article are available online
Statistical matching consists in estimating the joint characteristics of two variables observed in two distinct and independent sample surveys, respectively. In a parametric setup, ranges of estimates for non identifiable parameters are the only estimable items, unless restrictive assumptions on the probabilistic relationship between the non jointly observed variables are imposed. These ranges correspond to the uncertainty due to the absence of joint observations on the pair of variables of interest. The aim of this paper is to analyze the uncertainty in statistical matching in a non parametric setting. A measure of uncertainty is introduced, and its properties studied: this measure studies the "intrinsic" association between the pair of variables, which is constant and equal to 1/6 whatever the form of the marginal distribution functions of the two variables when knowledge on the pair of variables is the only one available in the two samples. This measure becomes useful in the context of the reduction of uncertainty due to further knowledge than data themselves, as in the case of structural zeros. In this case the proposed measure detects how the introduction of further knowledge shrinks the intrinsic uncertainty from 1/6 to smaller values, zero being the case of no uncertainty. Sampling properties of the uncertainty measure and of the bounds of the uncertainty intervals are also proved.
Bayesian networks are particularly useful for dealing with high dimensional statistical problems. They allow a reduction in the complexity of the phenomenon under study by representing joint relationships between a set of variables through conditional relationships between subsets of these variables. Following Thibaudeau and Winkler we use Bayesian networks for imputing missing values. This method is introduced to deal with the problem of the consistency of imputed values: preservation of statistical relationships between variables ("statistical consistency") and preservation of logical constraints in data ("logical consistency"). We perform some experiments on a subset of anonymous individual records from the 1991 UK population census. Copyright 2004 Royal Statistical Society.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.