High-dimensional principal component analysis with heterogeneous missingness

Zhu, Ziwei; Wang, Tengyao; Samworth, Richard J.

doi:10.48550/arxiv.1906.12125

Cited by 22 publications

(25 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is a generalization of the MAR setting as such an approach circumvents the requirement of meaningful auxiliary features X to conduct propensity score estimation. Additional works within the MNAR literature include Zhu et al (2019); Sportisse et al (2020a,b); Wang et al (2020).…”

Section: Missing Not At Random (Mnar) Mnar Is the Most Challenging Mi...mentioning

confidence: 99%

“…That is, the entries are missing not at random (MNAR). To address the above challenges, there has been exciting recent progress on matrix completion with MNAR data, including Schnabel et al (2016); Ma and Chen (2019); Zhu et al (2019); Sportisse et al (2020a,b); Wang et al (2020); Yang et al (2021); Bhattacharya and Chatterjee (2021). Through numerous empirical studies, these works have shown that algorithms that account for MNAR data outperform conventional algorithms that are designed for MCAR data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Causal Matrix Completion

Agarwal,

Dahleh,

Shah

et al. 2021

Preprint

View full text Add to dashboard Cite

Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed that the entries of the matrix are "missing completely at random" (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of "latent confounders", i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. For example, in the context of movie recommender systems-a canonical application for matrix completion-a user who vehemently dislikes horror films is unlikely to ever watch horror films. In general, these confounders yield "missing not at random" (MNAR) data, which can severely impact any inference procedure that does not correct for this bias.We develop a formal causal model for matrix completion through the language of potential outcomes, and provide novel identification arguments for a variety of causal estimands of interest. We design a procedure, which we call "synthetic nearest neighbors" (SNN), to estimate these causal estimands. We prove finite-sample consistency and asymptotic normality of our estimator. Our analysis also leads to new theoretical results for the matrix completion literature. In particular, we establish entry-wise, i.e., max-norm, finite-sample consistency and asymptotic normality results for matrix completion with MNAR data. As a special case, this also provides entry-wise bounds for matrix completion with MCAR data. Across simulated and real data, we demonstrate the efficacy of our proposed estimator.

show abstract

Section: Missing Not At Random (Mnar) Mnar Is the Most Challenging Mi...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Causal Matrix Completion

Agarwal,

Dahleh,

Shah

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Inadequacy of prior works. While methods for estimating principal subspace are certainly not in shortage (e.g., Balzano et al (2018); Cai et al (2021); Cai and Zhang (2018); Li et al (2021); Lounici (2014); Zhang et al (2018); Zhu et al (2019)), methods for constructing confidence regions for principal subspace remain vastly under-explored. The fact that the estimators in use for PCA are typically nonlinear and nonconvex presents a substantial challenge in the development of a distributional theory, let alone uncertainty quantification.…”

Section: Problem Formulationmentioning

confidence: 99%

Inference for Heteroskedastic PCA with Missing Data

Yan,

Chen,

Fan

2021

Preprint

View full text Add to dashboard Cite

This paper studies how to construct confidence regions for principal component analysis (PCA) in high dimension, a problem that has been vastly under-explored. While computing measures of uncertainty for nonlinear/nonconvex estimators is in general difficult in high dimension, the challenge is further compounded by the prevalent presence of missing data and heteroskedastic noise. We propose a suite of solutions to perform valid inference on the principal subspace based on two estimators: a vanilla SVD-based approach, and a more refined iterative scheme called HeteroPCA (Zhang et al., 2018). We develop non-asymptotic distributional guarantees for both estimators, and demonstrate how these can be invoked to compute both confidence regions for the principal subspace and entrywise confidence intervals for the spiked covariance matrix. Particularly worth highlighting is the inference procedure built on top of HeteroPCA, which is not only valid but also statistically efficient for broader scenarios (e.g., it covers a wider range of missing rates and signal-to-noise ratios). Our solutions are fully data-driven and adaptive to heteroskedastic random noise, without requiring prior knowledge about the noise levels and noise distributions.

show abstract

“…Several useful extensions have been developed tailored to high-dimensional statistical applications, particularly when the perturbation matrix of interest enjoys certain random structure , O'Rourke et al, 2018, Vu, 2011, Wang, 2015, Xia, 2019, Yu et al, 2015. In particular, the ℓ 2 perturbation bounds for the eigenvector (or eigenspace) of the sample covariance matrix has been extensively studied in the PCA literature, e.g., [Johnstone and Lu, 2009, Lounici, 2013, 2014, Nadler, 2008, Zhu et al, 2019. Another line of works [O'Rourke et al, 2018, Vu, 2011 improved Davis-Kahan's and Wedin's theorems in the matrix denoising setting with small eigengaps, which, however, is not tight unless the spectral norm H of the noise matrix is extremely small.…”

Section: Related Workmentioning

confidence: 99%

Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps

Li¹,

Cai²,

Gu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Eigenvector perturbation analysis plays a vital role in various statistical data science applications. A large body of prior works, however, focused on establishing ℓ 2 eigenvector perturbation bounds, which are often highly inadequate in addressing tasks that rely on fine-grained behavior of an eigenvector. This paper makes progress on this by studying the perturbation of linear functions of an unknown eigenvector. Focusing on two fundamental problems -matrix denoising and principal component analysis -in the presence of Gaussian noise, we develop a suite of statistical theory that characterizes the perturbation of arbitrary linear functions of an unknown eigenvector. In order to mitigate a non-negligible bias issue inherent to the natural "plug-in" estimator, we develop de-biased estimators that (1) achieve minimax lower bounds for a family of scenarios (modulo some logarithmic factor), and (2) can be computed in a data-driven manner without sample splitting. Noteworthily, the proposed estimators are nearly minimax optimal even when the associated eigen-gap is substantially smaller than what is required in prior theory.

show abstract

High-dimensional principal component analysis with heterogeneous missingness

Cited by 22 publications

References 42 publications

Causal Matrix Completion

Causal Matrix Completion

Inference for Heteroskedastic PCA with Missing Data

Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps

Contact Info

Product

Resources

About