Jae Kwang Kim scite author profile

Nonresponse is frequently encountered in empirical studies. When the response mechanism is missing not at random (MNAR) statistical inference using the observed data is quite challenging. Handling MNAR data often requires two model assumptions: one for the outcome and the other for the response propensity. Correctly specifying these two model assumptions is challenging and difficult to verify from the responses obtained. In this article we propose a semiparametric maximum likelihood method for MNAR data in the sense that a parametric assumption is used for the response propensity part of the model and a nonparametric model is used for the outcome part. The resulting analysis is more robust than the fully parametric approach. Some asymptotic properties of our estimators are derived. Results from a simulation study are also presented. The Canadian Journal of Statistics 45: 393–409; 2017 © 2017 Statistical Society of Canada

show abstract

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Kim

Tam

2020

Int Statistical Rev

View full text Add to dashboard Cite

Summary The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data.

show abstract

Sampling Techniques for Big Data Analysis

Kim

Wang

2018

Int Statistical Rev

View full text Add to dashboard Cite

Summary In analysing big data for finite population inference, it is critical to adjust for the selection bias in the big data. In this paper, we propose two methods of reducing the selection bias associated with the big data sample. The first method uses a version of inverse sampling by incorporating auxiliary information from external sources, and the second one borrows the idea of data integration by combining the big data sample with an independent probability sample. Two simulation studies show that the proposed methods are unbiased and have better coverage rates than their alternatives. In addition, the proposed methods are easy to implement in practice.

show abstract

Statistical data integration in survey sampling: a review

Yang

Kim

2020

Jpn J Stat Data Sci

View full text Add to dashboard Cite

Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework

Yang

Kim

2019

Scandinavian J Statistics

View full text Add to dashboard Cite

Predictive mean matching imputation is popular for handling item nonresponse in survey sampling. In this article, we study the asymptotic properties of the predictive mean matching estimator of the population mean. For variance estimation, the conventional bootstrap inference for matching estimators with fixed matches has been shown to be invalid due to the nonsmoothness nature of the matching estimator. We propose asymptotically valid replication variance estimation. The key strategy is to construct repli-cates of the estimator directly based on linear terms, instead of individual records of variables. Extension to nearest neighbor imputation is also discussed. A simulation study confirms that the new procedure provides valid variance estimation.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jae Kwang Kim

Semiparametric maximum likelihood estimation with data missing not at random

Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Sampling Techniques for Big Data Analysis

Statistical data integration in survey sampling: a review

Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework

Contact Info

Product

Resources

About