Summary
The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.
Summary
In analysing big data for finite population inference, it is critical to adjust for the selection bias in the big data. In this paper, we propose two methods of reducing the selection bias associated with the big data sample. The first method uses a version of inverse sampling by incorporating auxiliary information from external sources, and the second one borrows the idea of data integration by combining the big data sample with an independent probability sample. Two simulation studies show that the proposed methods are unbiased and have better coverage rates than their alternatives. In addition, the proposed methods are easy to implement in practice.
Predictive mean matching imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the predictive mean matching estimator of the population mean. For variance estimation, the conventional bootstrap inference for matching estimators with fixed matches has been shown to be invalid due to the nonsmoothness nature of the matching estimator. We propose asymptotically valid replication variance estimation. The key strategy is to construct repli-cates of the estimator directly based on linear terms, instead of individual records of variables. Extension to nearest neighbor imputation is also discussed. A simulation study confirms that the new procedure provides valid variance estimation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.