2019
DOI: 10.48550/arxiv.1902.00772
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

High-dimensional semi-supervised learning: in search for optimal inference of the mean

Abstract: We provide a high-dimensional semi-supervised inference framework focused on the mean and variance of the response. Our data are comprised of an extensive set of observations regarding the covariate vectors and a much smaller set of labeled observations where we observe both the response as well as the covariates. We allow the size of the covariates to be much larger than the sample size and impose weak conditions on a statistical form of the data. We provide new estimators of the mean and variance of the resp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
22
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(24 citation statements)
references
References 30 publications
2
22
0
Order By: Relevance
“…This estimator δ is exactly identical to the estimator δ in Definition 1 under MCAR assumption. In Appendix A, we further discuss how this estimator connects to those in Cheng et al [2018], Zhang and Bradic [2019], and how our result generalizes those in previous literature.…”
Section: Assumption 5 (Moment Condition)supporting
confidence: 77%
See 4 more Smart Citations
“…This estimator δ is exactly identical to the estimator δ in Definition 1 under MCAR assumption. In Appendix A, we further discuss how this estimator connects to those in Cheng et al [2018], Zhang and Bradic [2019], and how our result generalizes those in previous literature.…”
Section: Assumption 5 (Moment Condition)supporting
confidence: 77%
“…Condition (3) requires the primary outcome to be missing at random (MAR), i.e., the indicator R depends on only observed variables, including pre-treatment covariates X, the treatment T and surrogates S. This condition guarantees that the distribution of the primary outcome on the labelled data and unlabelled data are comparable after accounting for the observed variables, so that we can use the labelled data to infer information about the missing primary outcome in the unlabelled data. This condition is considerably weaker than the missing completely at random (MCAR) condition typically assumed in previous semi-supervised inference literature [e.g., Cheng et al, 2018, Zhang andBradic, 2019], since MCAR does not allow the missingness of the primary outcome to depend on any other variable. Condition (3) may be satisfied by design in a two-phase sampling scheme [e.g., Wang et al, 2009, Cochran, 2007: in the first phase, relatively cheap measurements of T, X, S are available for all units, and in the second phase, expensive measurements of the primary outcome Y are collected for a validation subsample selected according to variables measured in the first phase.…”
Section: Problem Setupmentioning
confidence: 93%
See 3 more Smart Citations