High-dimensional semi-supervised learning: in search for optimal inference of the mean

Zhang, Yuqian; Bradić, Jelena

doi:10.48550/arxiv.1902.00772

Cited by 5 publications

(24 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This estimator δ is exactly identical to the estimator δ in Definition 1 under MCAR assumption. In Appendix A, we further discuss how this estimator connects to those in Cheng et al [2018], Zhang and Bradic [2019], and how our result generalizes those in previous literature.…”

Section: Assumption 5 (Moment Condition)supporting

confidence: 77%

“…Condition (3) requires the primary outcome to be missing at random (MAR), i.e., the indicator R depends on only observed variables, including pre-treatment covariates X, the treatment T and surrogates S. This condition guarantees that the distribution of the primary outcome on the labelled data and unlabelled data are comparable after accounting for the observed variables, so that we can use the labelled data to infer information about the missing primary outcome in the unlabelled data. This condition is considerably weaker than the missing completely at random (MCAR) condition typically assumed in previous semi-supervised inference literature [e.g., Cheng et al, 2018, Zhang andBradic, 2019], since MCAR does not allow the missingness of the primary outcome to depend on any other variable. Condition (3) may be satisfied by design in a two-phase sampling scheme [e.g., Wang et al, 2009, Cochran, 2007: in the first phase, relatively cheap measurements of T, X, S are available for all units, and in the second phase, expensive measurements of the primary outcome Y are collected for a validation subsample selected according to variables measured in the first phase.…”

Section: Problem Setupmentioning

confidence: 93%

“…This condition guarantees that the distribution of the subsample with primary-outcome observations (labelled data) is comparable with the distribution of the remaining sample without primary-outcome observations (unlabelled data) after adjusting for the observed variables. Similar missingness conditions are also commonly assumed in previous literature that combine different datasets [e.g., Athey et al, 2019, Cheng et al, 2018, Zhang and Bradic, 2019. Under only these standard assumptions, and in particular no overly restrictive surrogate conditions, we aim to investigate the role of surrogates in estimating treatment effects when the primary-outcome observations are limited.…”

Section: Introductionmentioning

confidence: 88%

“…Our paper is related to the growing body of literature on parameter estimation and inference in the semi-supervised setting where a small labelled dataset is enriched with a large unlabelled dataset. A string of research investigate how to the use unlabelled data to aid the estimation of a wide variety of parameters, including linear regression coefficients [Azriel et al, 2016, Chakrabortty et al, 2018, population mean and average treatment effect Bradic, 2019], performance measures of a given classification rule like receiver operating characteristic (ROC) curve [Gronsbell and Cai, 2018], etc. These papers typically propose estimators for a finite dimensional target parameter and study their asymptotic performance, which supplement extensive literature on semi-supervised learning of prediction rules [see Zhu and Goldberg, 2009, for a comprehensive review].…”

Section: Related Literaturementioning

confidence: 99%

“…[2018], Zhang and Bradic [2019]. In this special setting, R is independent with all other variables, so λ * (X) = f * (X)/f * (X | R = 1) = 1 does not need to be estimated, and the estimator δ in Definition 2 reduces to…”

Section: Assumption 5 (Moment Condition)mentioning

confidence: 99%

See 4 more Smart Citations

On the role of surrogates in the efficient estimation of treatment effects with limited outcome data

Kallus¹,

Mao²

2020

Preprint

View full text Add to dashboard Cite

We study the problem of estimating treatment effects when the outcome of primary interest (e.g., long-term health status) is only seldom observed but abundant surrogate observations (e.g., short-term health outcomes) are available. To investigate the role of surrogates in this setting, we derive the semiparametric efficiency lower bounds of average treatment effect (ATE) both with and without presence of surrogates, as well as several intermediary settings. These bounds characterize the best-possible precision of ATE estimation in each case, and their difference quantifies the efficiency gains from optimally leveraging the surrogates in terms of key problem characteristics when only limited outcome data are available. We show these results apply in two important regimes: when the number of surrogate observations is comparable to primary-outcome observations and when the former dominates the latter. Importantly, we take a missing-data approach that circumvents strong surrogate conditions which are commonly assumed in previous literature but almost always fail in practice. To show how to leverage the efficiency gains of surrogate observations, we propose ATE estimators and inferential methods based on flexible machine learning methods to estimate nuisance parameters that appear in the influence functions. We show our estimators enjoy efficiency and robustness guarantees under weak conditions.

show abstract

Section: Assumption 5 (Moment Condition)supporting

confidence: 77%

Section: Problem Setupmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 88%

Section: Related Literaturementioning

confidence: 99%

Section: Assumption 5 (Moment Condition)mentioning

confidence: 99%

See 3 more Smart Citations

On the role of surrogates in the efficient estimation of treatment effects with limited outcome data

Kallus¹,

Mao²

2020

Preprint

View full text Add to dashboard Cite

show abstract

Semi-Supervised Statistical Inference for High-Dimensional Linear Regression with Blockwise Missing Data

Xue¹,

Ma²,

Li³

2021

Preprint

View full text Add to dashboard Cite

Blockwise missing data occurs frequently when we integrate multisource or multimodality data where different sources or modalities contain complementary information. In this paper, we consider a high-dimensional linear regression model with blockwise missing covariates and a partially observed response variable. Under this semi-supervised framework, we propose a computationally efficient estimator for the regression coefficient vector based on carefully constructed unbiased estimating equations and a multiple blockwise imputation procedure, and obtain its rates of convergence. Furthermore, building upon an innovative semi-supervised projected estimating equation technique that intrinsically achieves biascorrection of the initial estimator, we propose nearly unbiased estimators for the individual regression coefficients that are asymptotically normally distributed under mild conditions. By carefully analyzing these debiased estimators, asymptotically valid confidence intervals and statistical tests about each regression coefficient are constructed. Numerical studies and application analysis of the Alzheimer's Disease Neuroimaging Initiative data show that the proposed method performs better and benefits more from unsupervised samples than existing methods.

show abstract

A General Framework for Treatment Effect Estimation in Semi-Supervised and High Dimensional Settings

Chakrabortty¹,

Dai²,

Tchetgen³

2022

Preprint

View full text Add to dashboard Cite

In this article, we aim to provide a general and complete understanding of semi-supervised (SS) causal inference for treatment effects, using two such estimands as prototype cases. Specifically, we consider estimation of: (a) the average treatment effect and (b) the quantile treatment effect, in an SS setting, which is characterized by two available data sets: (i) a labeled data set of size n, providing observations for a response and a set of potentially high dimensional covariates, as well as a binary treatment indicator; and (ii) an unlabeled data set of size N , much larger than n, but without the response observed. Using these two data sets, we develop a family of SS estimators which are guaranteed to be: (1) more robust and (2) more efficient, than their supervised counterparts based on the the labeled data set only. Moreover, beyond the "standard" double robustness results (in terms of consistency) that can be achieved by supervised methods as well, we further establish root-n consistency and asymptotic normality of our SS estimators whenever the propensity score in the model is correctly specified, without requiring specific forms of the nuisance functions involved. Such an improvement in robustness arises from the use of the massive unlabeled data, so it is generally not attainable in a purely supervised setting. In addition, our estimators are shown to be semiparametrically efficient also as long as all the nuisance functions are correctly specified. Moreover, as an illustration of the nuisance function estimation, we consider inverse-probability-weighting type kernel smoothing estimators involving possibly unknown covariate transformation mechanisms, and establish in high dimensional scenarios novel results on their uniform convergence rates. These results should be of independent interest. Numerical results on both simulated and real data validate the advantage of our methods over their supervised counterparts with respect to both robustness and efficiency.

show abstract

High-dimensional semi-supervised learning: in search for optimal inference of the mean

Cited by 5 publications

References 30 publications

On the role of surrogates in the efficient estimation of treatment effects with limited outcome data

On the role of surrogates in the efficient estimation of treatment effects with limited outcome data

Semi-Supervised Statistical Inference for High-Dimensional Linear Regression with Blockwise Missing Data

A General Framework for Treatment Effect Estimation in Semi-Supervised and High Dimensional Settings

Contact Info

Product

Resources

About