In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.
We study the problem of estimating the leading eigenvectors of a high-dimensional population covariance matrix based on independent Gaussian observations. We establish a lower bound on the minimax risk of estimators under the l2 loss, in the joint limit as dimension and sample size increase to infinity, under various models of sparsity for the population eigenvectors. The lower bound on the risk points to the existence of different regimes of sparsity of the eigenvectors. We also propose a new method for estimating the eigenvectors by a two-stage coordinate selection scheme.
In this paper, we consider the problem of estimating the eigenvalues and eigenfunctions of the covariance kernel (i.e., the functional principal components) from sparse and irregularly observed longitudinal data. We approach this problem through a maximum likelihood method assuming that the covariance kernel is smooth and finite dimensional. We exploit the smoothness of the eigenfunctions to reduce dimensionality by restricting them to a lower dimensional space of smooth functions. The estimation scheme is developed based on a Newton-Raphson procedure using the fact that the basis coefficients representing the eigenfunctions lie on a Stiefel manifold. We also address the selection of the right number of basis functions, as well as that of the dimension of the covariance kernel by a second order approximation to the leave-one-curve-out cross-validation score that is computationally very efficient. The effectiveness of our procedure is demonstrated by simulation studies and an application to a CD4 counts data set. In the simulation studies, our method performs well on both estimation and model selection. It also outperforms two existing approaches: one based on a local polynomial smoothing of the empirical covariances, and another using an EM algorithm.
b s t r a c tWe give an overview of random matrix theory (RMT) with the objective of highlighting the results and concepts that have a growing impact in the formulation and inference of statistical models and methodologies. This paper focuses on a number of application areas especially within the field of high-dimensional statistics and describes how the development of the theory and practice in high-dimensional statistical inference has been influenced by the corresponding developments in the field of RMT.
Recent proteomic studies have identified proteins related to specific
phenotypes. In addition to marginal association analysis for individual
proteins, analyzing pathways (functionally related sets of proteins) may yield
additional valuable insights. Identifying pathways that differ between
phenotypes can be conceptualized as a multivariate hypothesis testing problem:
whether the mean vector μ of a
p-dimensional random vector X is
μ0. Proteins within the same biological
pathway may correlate with one another in a complicated way, and type I error
rates can be inflated if such correlations are incorrectly assumed to be absent.
The inflation tends to be more pronounced when the sample size is very small or
there is a large amount of missingness in the data, as is frequently the case in
proteomic discovery studies. To tackle these challenges, we propose a
regularized Hotelling’s T2
() statistic together with a non-parametric
testing procedure, which effectively controls the type I error rate and
maintains good power in the presence of complex correlation structures and
missing data patterns. We investigate asymptotic properties of the
statistic under pertinent assumptions and compare
the test performance with four existing methods through simulation examples. We
apply the test to a hormone therapy proteomics data
set, and identify several interesting biological pathways for which blood serum
concentrations changed following hormone therapy initiation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.