Partial least squares, as a dimension reduction technique, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken estimation performance, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. Integrative analysis has essential importance in multidatasets analysis. The main idea is to improve performance by assembling raw data from multiple independent datasets and analyzing them jointly. In this article, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis. The second penalty, a contrasted penalty, is imposed to encourage the similarity of estimates across datasets and generate more sensible and accurate results. Computational algorithms are developed. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.
K E Y W O R D Scontrasted penalization, integrative analysis, partial least squares
INTRODUCTIONData with high-dimensional variables are becoming routine. With such data, partial least squares, initially developed by Wold et al, 1 has been successfully used as a dimension reduction method in many areas such as chemometrics 2 and genetics. 3 PLS reduces variable dimension by constructing new components, which are linear combinations of the original variables. It possesses much-desired properties such as stability under collinearity and high-dimensionality, rendering it clear superiority over many other methods. In high-dimensional analysis, noise accumulation from irrelevant variables has long been recognized. 4 For example, in omics studies, it is wildly accepted that only a small fraction of genes are associated with outcomes. To yield more accurate estimation and facilitate interpretation, variable selection needs to be considered. Chun and Keleş 5 propose a sparse PLS (SPLS) technique to conduct variable selection and dimension reduction simultaneously by imposing the elastic net penalization in the PLS optimization.