Boosted Kernel Weighting – Using Statistical Learning to Improve Inference from Nonprobability Samples

Kern, Christoph; Li, Yan; Lingxiao, Wang

doi:10.1093/jssam/smaa028

Cited by 11 publications

(8 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Logistic models are often used to estimate the propensity to participate in the survey of each individual. In recent decades, numerous machine‐learning (ML) methods have been considered in the literature for the treatment of nonprobability samples and have proved to be more suitable for regression and classification than linear regression methods (Castro‐Martín et al., 2020 ; Chu & Beaumont, 2019 ; Ferri‐García & Rueda, 2020 ; Kern et al., 2020 ).…”

Section: Methodsmentioning

confidence: 99%

Enhancing estimation methods for integrating probability and nonprobability survey samples with machine‐learning techniques. An application to a Survey on the impact of the COVID‐19 pandemic in Spain

et al. 2022

View full text Add to dashboard Cite

Web surveys have replaced Face‐to‐Face and computer assisted telephone interviewing (CATI) as the main mode of data collection in most countries. This trend was reinforced as a consequence of COVID‐19 pandemic‐related restrictions. However, this mode still faces significant limitations in obtaining probability‐based samples of the general population. For this reason, most web surveys rely on nonprobability survey designs. Whereas probability‐based designs continue to be the gold standard in survey sampling, nonprobability web surveys may still prove useful in some situations. For instance, when small subpopulations are the group under study and probability sampling is unlikely to meet sample size requirements, complementing a small probability sample with a larger nonprobability one may improve the efficiency of the estimates. Nonprobability samples may also be designed as a mean for compensating for known biases in probability‐based web survey samples by purposely targeting respondent profiles that tend to be underrepresented in these surveys. This is the case in the Survey on the impact of the COVID‐19 pandemic in Spain (ESPACOV) that motivates this paper. In this paper, we propose a methodology for combining probability and nonprobability web‐based survey samples with the help of machine‐learning techniques. We then assess the efficiency of the resulting estimates by comparing them with other strategies that have been used before. Our simulation study and the application of the proposed estimation method to the second wave of the ESPACOV Survey allow us to conclude that this is the best option for reducing the biases observed in our data.

show abstract

Section: Methodsmentioning

confidence: 99%

Enhancing estimation methods for integrating probability and nonprobability survey samples with machine‐learning techniques. An application to a Survey on the impact of the COVID‐19 pandemic in Spain

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Many techniques, varying in sophistication, can be used to estimate the propensity score ( 22 ). Concretely, logistic regression is the most commonly used method for fitting the propensity score; in this case, Σ is taken to be the class of linear functions passed through the logistic activation.…”

Section: Propensity Score Reweightingmentioning

confidence: 99%

Universal adaptability: Target-independent inference that competes with propensity scoring

Kim

Kern

Goldwasser

et al. 2022

Proc. Natl. Acad. Sci. U.S.A.

Self Cite

View full text Add to dashboard Cite

The gold-standard approaches for gleaning statistically valid conclusions from data involve random sampling from the population. Collecting properly randomized data, however, can be challenging, so modern statistical methods, including propensity score reweighting, aim to enable valid inferences when random sampling is not feasible. We put forth an approach for making inferences based on available data from a source population that may differ in composition in unknown ways from an eventual target population. Whereas propensity scoring requires a separate estimation procedure for each different target population, we show how to build a single estimator, based on source data alone, that allows for efficient and accurate estimates on any downstream target data. We demonstrate, theoretically and empirically, that our target-independent approach to inference, which we dub “universal adaptability,” is competitive with target-specific approaches that rely on propensity scoring. Our approach builds on a surprising connection between the problem of inferences in unspecified target populations and the multicalibration problem, studied in the burgeoning field of algorithmic fairness. We show how the multicalibration framework can be employed to yield valid inferences from a single source population across a diverse set of target populations.

show abstract

“…Very recently, a kernel weighting approach has been proposed by , where the weighted estimator is proved to be consistent under a weak exchangeability condition. To further weaken the modeling assumptions, Kern et al (2020) propose to use algorithmic tree-based methods, including random forests and gradient tree boosting, for estimating the PS in kernel weighting.…”

Section: Introductionmentioning

confidence: 99%

Robust and Efficient Bayesian Inference for Non-Probability Samples

Rafei¹,

Elliott²,

Flannagan³

2022

Preprint

View full text Add to dashboard Cite

The declining response rates in probability surveys along with the widespread availability of unstructured data has led to growing research into non-probability samples. Existing robust approaches are not well-developed for non-Gaussian outcomes and may perform poorly in presence of influential pseudo-weights. Furthermore, their variance estimator lacks a unified framework and rely often on asymptotic theory. To address these gaps, we propose an alternative Bayesian approach using a partially linear Gaussian process regression that utilizes a prediction model with a flexible function of the pseudo-inclusion probabilities to impute the outcome variable for the reference survey. By efficiency, we mean not only computational scalability but also superiority with respect to variance. We also show that Gaussian process regression behaves as a kernel matching technique based on the estimated propensity scores, which yields double robustness and lowers sensitivity to influential pseudo-weights. Using the simulated posterior predictive distribution, one can directly quantify the uncertainty of the proposed estimator and derive associated 95% credible intervals. We assess the repeated sampling properties of our method in two simulation studies. The application of this study deals with modeling count data with varying exposures under a non-probability sample setting.

show abstract

Boosted Kernel Weighting – Using Statistical Learning to Improve Inference from Nonprobability Samples

Cited by 11 publications

References 38 publications

Enhancing estimation methods for integrating probability and nonprobability survey samples with machine‐learning techniques. An application to a Survey on the impact of the COVID‐19 pandemic in Spain

Enhancing estimation methods for integrating probability and nonprobability survey samples with machine‐learning techniques. An application to a Survey on the impact of the COVID‐19 pandemic in Spain

Universal adaptability: Target-independent inference that competes with propensity scoring

Robust and Efficient Bayesian Inference for Non-Probability Samples

Contact Info

Product

Resources

About