This article describes the applicability of multivariate projection techniques, such as principal-component analysis (PCA) and partial least-squares (PLS) projections to latent structures, to the large-volume high-density data structures obtained within genomics, proteomics, and metabonomics. PCA and PLS, and their extensions, derive their usefulness from their ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. Three examples are used as illustrations: the first example is a genomics data set and involves modeling of microarray data of cell cycle-regulated genes in the microorganism Saccharomyces cerevisiae. The second example contains NMR-metabonomics data, measured on urine samples of male rats treated with either of the drugs chloroquine or amiodarone. The third and last data set describes sequence-function classification studies in a set of G-protein-coupled receptors using hierarchical PCA.
Sweden
SUMMARYA fast and memory-saving PLS regression algorithm for matrices with large numbers of objects is presented. It is called the kernel algorithm for PLS. Long (meaning having many objects, N ) matrices X (N x K ) and Y (N x M ) are condensed into a small (K x K ) square 'kernel' matrix XTYY 'X of size equal to the number of X-variables. Using this kernel matrix XTYYTX together with the small covariance given for the kernel and the classical PLS algorithm. As appendices, a condensed matrix algebra version of the kernel algorithm is given together with the MATLAB code. NIPALS is a robust procedure for solving eigenvector-eigenvalue-related problems where the eigenvectors (components, factors) are calculated in a partial fashion, one at a time, until all variance in the data structure is explained. For each new dimension the information explained by the last component is subtracted from the data matrices X and Y to create residuals, on which subsequent dimensions are calculated by the same procedure.In the classical PLS algorithm the sequential calculation of PLS dimensions is done iteratively. '-lo In general, iterative algorithms often become inefficient when the data structure to be modelled is large. The classical PLS algorithm is no exception.
SUMMARYA fast PLS regression algorithm dealing with large data matrices with many variables (K) and fewer objects (N) is presented. For such data matrices the classical algorithm is computer-intensive and memory-demanding. Recently, Lindgren et al. (J. Chemometrics, 7,45-49 (1993)) developed a quick and efficient kernel algorithm for the case with many objects and few variables. The present paper is focused on the opposite case, i.e. many variables and fewer objects. A kernel algorithm is presented based on eigenvectors to the 'kernel' matrix XXTYYT, which is a square, non-symmetric matrix of size N x N, where N is the number of objects. Using the kernel matrix and the association matrices XX' (N x N) and YY' (N x N), it is possible to calculate all score and loading vectors and hence conduct a complete PLS regression including diagnostics such as R 2 . This is done without returning to the original data matrices X and Y. The algorithm is presented in equation form, with proofs of some new properties and as MATLAB code.
Regression rnodel validation by permutation tests was explored. Especially in cases where the rnodel significance is doubtful, a permutation test adds crucial information which can often can be decisive for the existence of the rnodel. The background and applicability of the test procedure are described. As an example, the use of permutation tests was extended to validation and investigation of four predictor variable selection techniques, namely MUSEUM, GOLPE, VIP and IVS-PLS. "he selection methods are briefly reviewed and cornpared. The permutation tests were applied before, during and after variable selection. Some similarities and differences in the behaviour of the variable selection techniques were found and are cornmented upon. O
The information contents in previously published peptide sets was compared with smaller sets of peptides selected according to statistical designs. It was found that minimum analogue peptide sets (MAPS) constructed by factorial or fractional factorial designs in physicochemical properties contained substantial structure‐activity information. Although five to six times smaller than the originally published peptide sets the MAPS resulted in QSAR models able to predict biological activity. The QSARs derived from a MAPS of nine dipeptides, and from a set of 58 dipeptides inhibiting angiotensin converting enzyme were compared and found to be of equal strength. Furthermore, for a set of bitter tasting dipeptides it was found that an incomplete MAPS of 10 dipeptides gave just as good a model as the model based on a set of 48 dipeptides. By comparison other non‐designed sets of peptides gave QSARs with poor predictive power. It was also demonstrated how MAPS centered on a lead peptide can be constructed as to specifically explore the physicochemical and biological properties in the vicinity of the lead. It was concluded that small information‐rich peptide sets MAPS can be constructed on the basis of statistical designs with principal properties of amino acids as design variables.
The reduction of the size of a combinatorial library can be made in two ways, either base the selection on the building blocks (BB's) or base it on the full set of virtually constructed products. In this paper we have investigated the effects of applying statistical designs to BB sets compared to selections based on the final products. The two sets of BB's and the virtually constructed library were described by structural parameters, and the correlation between the two characterizations was investigated. Three different selection approaches were used both for the BB sets and for the products. In the first two the selection algorithms were applied directly to the data sets (D-optimal design and space-filling design), while for the third a cluster analysis preceded the selection (cluster-based design). The selections were compared using visual inspection, the Tanimoto coefficient, the Euclidean distance, the condition number, and the determinant of the resulting data matrix. No difference in efficiency was found between selections made in the BB space and in the product space. However, it is of critical importance to investigate the BB space carefully and to select an appropriate number of BB's to result in an adequate diversity. An example from the pharmaceutical industry is then presented, where selection via BB's was made using a cluster-based design.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.