We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor analysis-of-variance problem as the most important and well-known example. Instead of selecting factors by stepwise backward elimination, we focus on the accuracy of estimation and consider extensions of the lasso, the LARS algorithm and the non-negative garrotte for factor selection. The lasso, the LARS algorithm and the non-negative garrotte are recently proposed regression methods that can be used to select individual variables. We study and propose efficient algorithms for the extensions of these methods for factor selection and show that these extensions give superior performance to the traditional stepwise backward elimination method in factor selection problems. We study the similarities and the differences between these methods. Simulations and real examples are used to illustrate the methods. Copyright 2006 Royal Statistical Society.
We propose penalized likelihood methods for estimating the concentration matrix in the Gaussian graphical model. The methods lead to a sparse and shrinkage estimator of the concentration matrix that is positive definite, and thus conduct model selection and estimation simultaneously. The implementation of the methods is nontrivial because of the positive definite constraint on the concentration matrix, but we show that the computation can be done effectively by taking advantage of the efficient maxdet algorithm developed in convex optimization. We propose a BIC-type criterion for the selection of the tuning parameter in the penalized likelihood methods. The connection between our methods and existing methods is illustrated. Simulations and real examples demonstrate the competitive performance of the new methods.
We propose a semiparametric approach called the nonparanormal skeptic for efficiently and robustly estimating high dimensional undirected graphical models. To achieve modeling flexibility, we consider the nonparanormal graphical models proposed by Liu, Lafferty and Wasserman (2009). To achieve estimation robustness, we exploit nonparametric rank-based correlation coefficient estimators, including the Spearman's rho and Kendall's tau. We prove that the nonparanormal skeptic achieves the optimal parametric rates of convergence for both graph recovery and parameter estimation. This result suggests that the nonparanormal graphical models can be used as a safe replacement of the popular Gaussian graphical models, even when the data are truly Gaussian. Besides theoretical analysis, we also conduct thorough numerical simulations to compare the graph recovery performance of different estimators under both ideal and noisy settings. The proposed methods are then applied on a largescale genomic dataset to illustrate their empirical usefulness. The R package huge implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran.r-project.org/.
Coefficient estimation and variable selection in multiple linear regression is routinely done in the (penalized) least squares (LS) framework. The concept of model selection oracle introduced by Fan and Li [J. Amer. Statist. Assoc. 96 (2001) 1348-1360 characterizes the optimal behavior of a model selection procedure. However, the leastsquares oracle theory breaks down if the error variance is infinite. In the current paper we propose a new regression method called composite quantile regression (CQR). We show that the oracle model selection theory using the CQR oracle works beautifully even when the error variance is infinite. We develop a new oracular procedure to achieve the optimal properties of the CQR oracle. When the error variance is finite, CQR still enjoys great advantages in terms of estimation efficiency. We show that the relative efficiency of CQR compared to the least squares is greater than 70% regardless the error distribution. Moreover, CQR could be much more efficient and sometimes arbitrarily more efficient than the least squares. The same conclusions hold when comparing a CQR-oracular estimator with a LS-oracular estimator.
Sparse discriminant methods based on independence rules, such as the nearest shrunken centroids classifier (Tibshirani et al. 2002) and features annealed independent rules (Fan & Fan, 2008), have been proposed as computationally attractive tools for feature selection and classification with high-dimensional data. A fundamental drawback of these rules is that they ignore correlations among features and thus could produce misleading feature selections and inferior classifications. We propose a new recipe for sparse discriminant analysis, motivated by least squares formulation of linear discriminant analysis. To demonstrate our proposal, we study the numerical and theoretical properties of discriminant analysis constructed via Lasso/SCAD penalized least squares. Our theory shows that both the proposed methods can consistently identify the subset of discriminative features contributing to the Bayes rule and at the same time consistently estimate the Bayes classification direction, even when the dimension can grow faster than any polynomial order of the sample size. The theory allows for general dependence among features. Simulated and real data examples show that our methods compare favorably with other popular sparse discriminant proposals in the literature.
We study in this paper a smoothness regularization method for functional linear regression and provide a unified treatment for both the prediction and estimation problems. By developing a tool on simultaneous diagonalization of two positive definite kernels, we obtain shaper results on the minimax rates of convergence and show that smoothness regularized estimators achieve the optimal rates of convergence for both prediction and estimation under conditions weaker than those for the functional principal components based methods developed in the literature. Despite the generality of the method of regularization, we show that the procedure is easily implementable. Numerical results are obtained to illustrate the merits of the method and to demonstrate the theoretical developments.
Summary Coactivator-associated arginine methyltransferase 1 (CARM1), a coactivator for various cancer-relevant transcription factors, is overexpressed in breast cancer. To elucidate the functions of CARM1 in tumorigenesis, we knocked out CARM1 from several breast cancer cell lines using Zinc-Finger Nuclease technology, which resulted in drastic phenotypic and biochemical changes. The CARM1 KO cell lines enabled identification of CARM1 substrates, notably the SWI/SNF core subunit BAF155. Methylation of BAF155 at R1064 was found to be an independent prognostic biomarker for cancer recurrence and to regulate breast cancer cell migration and metastasis. Furthermore, CARM1-mediated BAF155 methylation affects gene expression by directing methylated BAF155 to unique chromatin regions (e.g., c-Myc pathway genes). Collectively, our studies uncover a mechanism by which BAF155 acquires tumorigenic functions via arginine methylation.
Summary. We introduce a general formulation for dimension reduction and coefficient estimation in the multivariate linear model. We argue that many of the existing methods that are commonly used in practice can be formulated in this framework and have various restrictions. We continue to propose a new method that is more flexible and more generally applicable. The method proposed can be formulated as a novel penalized least squares estimate. The penalty that we employ is the coefficient matrix's Ky Fan norm. Such a penalty encourages the sparsity among singular values and at the same time gives shrinkage coefficient estimates and thus conducts dimension reduction and coefficient estimation simultaneously in the multivariate linear model. We also propose a generalized cross-validation type of criterion for the selection of the tuning parameter in the penalized least squares. Simulations and an application in financial econometrics demonstrate competitive performance of the new method. An extension to the non-parametric factor model is also discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.