We consider quantile estimation in a semi-supervised setting, where one has two available data sets: (i) a small or moderate sized labeled data set containing observations for a response and a set of possibly high dimensional covariates, and (ii) a much larger unlabeled data set where only the covariates are observed. Such settings are of increasing relevance in modern studies involving large databases where labeled data may be limited due to practical constraints but unlabeled data are plentiful, and it is of interest to investigate how the latter may be exploited. We propose a family of semi-supervised estimators for the response quantile(s) based on the two data sets, to improve the estimation accuracy compared to the supervised estimator, i.e., the sample quantile, which uses the labeled data only. These estimators are based on a flexible imputation strategy applied to the estimating equation along with a debiasing step that allows for full robustness against misspecification of the imputation model. Further, a one-step update strategy is adopted to enable easy implementation of our method and handle the inevitable complexity arising from the non-linear nature of the quantile estimating equation. Under fairly mild assumptions, we prove our estimators are fully robust to the choice of the nuisance imputation model, in the sense of always maintaining root-n consistency and asymptotic normality, while having improved efficiency relative to the supervised estimator. Further, they achieve semi-parametric optimality also, provided the relation between the response and the covariates is correctly specified via the imputation model. In addition, as an illustration of estimating the nuisance imputation function, we consider kernel smoothing type estimators on lower dimensional and possibly estimated transformations of the high dimensional covariates, and we establish novel results on uniform convergence rates of such kernel smoothing estimators in high dimensions, involving responses indexed by a function class and usage of dimension reduction techniques. These results may be of independent interest. Numerical results on both simulated and real data confirm our semisupervised approach's improved performance, both in terms of estimation as well as inference.
In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects, using two such estimands as prototype cases. Specifically, we consider estimation of: (a) the average treatment effect and (b) the quantile treatment effect, in an SS setting, which is characterized by two available data sets: (i) a labeled data set of size n, providing observations for a response and a set of potentially high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size N , much larger than n, but without the response observed. Using these two data sets, we develop a family of SS estimators which are guaranteed to be: (1) more robust and (2) more efficient, than their supervised counterparts based on the the labeled data set only. Moreover, beyond the "standard" double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement in robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semiparametrically efficient also as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance function estimation, we consider inverse-probability-weighting type kernel smoothing estimators involving possibly unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates. These results should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency.
We consider nonlinear regression models that are solely defined by a parametric model for the regression function. The responses are assumed to be missing at random, with the missingness depending on multiple covariates. We propose estimators for expectations of a known function of response and covariates. Our estimator is a nonparametric estimator corrected for the regression function. We show that it is asymptotically efficient in the Hájek and Le Cam sense. Simulations and an example using real data confirm the optimality of our approach.
This article deals with the analysis of high dimensional data that come from multiple sources ("experiments") and thus have different possibly correlated responses, but share the same set of predictors. The measurements of the predictors may be different across experiments. We introduce a new regression approach with multiple quantiles to select those predictors that affect any of the responses at any quantile level and estimate the nonzero parameters. Our estimator is a minimizer of a penalized objective function, which aggregates the data from the different experiments. We establish model selection consistency and asymptotic normality of the estimator. In addition we present an information criterion, which can also be used for consistent model selection. Simulations and two data applications illustrate the advantages of our method, which takes the group structure induced by the predictors across experiments and quantile levels into account.
This article considers a linear model in a high dimensional data scenario. We propose a process which uses multiple loss functions both to select relevant predictors and to estimate parameters, and study its asymptotic properties. Variable selection is conducted by a procedure called "vote", which aggregates results from penalized loss functions. Using multiple objective functions separately simplifies algorithms and allows parallel computing, which is convenient and fast. As a special example we consider a quantile regression model, which optimally combines multiple quantile levels. We show that the resulting estimators for the parameter vector are asymptotically efficient. Simulations and a data application confirm the three main advantages of our approach: (a) reducing the false discovery rate of variable selection; (b) improving the quality of parameter estimation; (c) increasing the efficiency of computation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.