The Lasso (Tibshirani, 1996) is an attractive technique for regularization and variable selection for high-dimensional data, where the number of predictor variables p is potentially much larger than the number of samples n. However, it was recently discovered (Zhao and Yu, 2006;Zou, 2005;Meinshausen and Bühlmann, 2006) that the sparsity pattern of the Lasso estimator can only be asymptotically identical to the true sparsity pattern if the design matrix satisfies the so-called irrepresentable condition. The latter condition can easily be violated in applications due to the presence of highly correlated variables.Here we examine the behavior of the Lasso estimators if the irrepresentable condition is relaxed. Even though the Lasso cannot recover the correct sparsity pattern, we show that the estimator is still consistent in the 2 -norm sense for fixed designs under conditions on (a) the number s n of non-zero components of the vector β n and (b) the minimal singular values of the design matrices that are induced by selecting of order s n variables. The results are extended to vectors β in weak q -balls with 0 < q < 1. Our results imply that, with high probability, all important variables are selected. The set of selected variables is a useful (meaningful) reduction on the original set of variables (p n > n). Finally, our results are illustrated with the detection of closely adjacent frequencies, a problem encountered in astrophysics. * Acknowledgments We would like to thank Noureddine El Karoui and Debashis Paul for pointing out interesting connections to Random Matrix theory. Some results of this manuscript have been presented at the Oberwolfach workshop "Qualitative Assumptions and Regularization for High-Dimensional Data". Nicolai Meinshausen is supported by DFG (Deutsche Forschungsgemeinschaft) and Bin Yu is partially supported by a Guggenheim fellowship and grants NSF DMS-0605165 (06-08), NSF DMS-03036508 (03-05) and ARO W911NF-05-1-0104 (05-07). Part of this work has been presented at Oberwolfach workshop 0645, "Qualitative Assumptions and Regularization in High-Dimensional Statistics".
1
Report Documentation PageForm Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number.
REPORT DATE
DEC 2006