A new procedure with high ability to enhance prediction of multivariate calibration models with a small number of interpretable variables is presented. The core of this methodology is to sort the variables from an informative vector, followed by a systematic investigation of PLS regression models with the aim of finding the most relevant set of variables by comparing the cross-validation parameters of the models obtained. In this work, seven main informative vectors i.e. regression vector, correlation vector, residual vector, variable influence on projection (VIP), net analyte signal (NAS), covariance procedures vector (CovProc), signal-to-noise ratios vector (StN) and their combinations were automated and tested with the main purpose of feature selection. Six data sets from different sources were employed to validate this methodology. They originated from: near-Infrared (NIR) spectroscopy, Raman spectroscopy, gas chromatography (GC), fluorescence spectroscopy, quantitative structure-activity relationships (QSAR) and computer simulation. The results indicate that all vectors and their combinations were able to enhance prediction capability with respect to the full data sets. However, regression and NAS informative vectors from partial least squares (PLS) regression, both built using more latent variables than when building the model presented in most of tested data sets, were the best informative vectors for variable selection. In all the applications, the selected variables were quite effective and useful for interpretation.
A novel 4D-QSAR approach which makes use of the molecular dynamics (MD) trajectories and topology information retrieved from the GROMACS package is presented in this study. This new methodology, named LQTA-QSAR (LQTA, Laboratório de Quimiometria Teórica e Aplicada), has a module (LQTAgrid) that calculates intermolecular interaction energies at each grid point considering probes and all aligned conformations resulting from MD simulations. These interaction energies are the independent variables or descriptors employed in a QSAR analysis. The comparison of the proposed methodology to other 4D-QSAR and CoMFA formalisms was performed using a set of forty-seven glycogen phosphorylase b inhibitors (data set 1) and a set of forty-four MAP p38 kinase inhibitors (data set 2). The QSAR models for both data sets were built using the ordered predictor selection (OPS) algorithm for variable selection. Model validation was carried out applying y-randomization and leave-N-out cross-validation in addition to the external validation. PLS models for data set 1 and 2 provided the following statistics: q(2) = 0.72, r(2) = 0.81 for 12 variables selected and 2 latent variables and q(2) = 0.82, r(2) = 0.90 for 10 variables selected and 5 latent variables, respectively. Visualization of the descriptors in 3D space was successfully interpreted from the chemical point of view, supporting the applicability of this new approach in rational drug design.
Recebido em 4/6/12; aceito em 15/11/12; publicado na web em 12/3/13 QSAR MODELING: A NEW OPEN SOURCE COMPUTATIONAL PACKAGE TO GENERATE AND VALIDATE QSAR MODELS. QSAR modeling is a novel computer program developed to generate and validate QSAR or QSPR (quantitative structure-activity or property relationships) models. With QSAR modeling, users can build partial least squares (PLS) regression models, perform variable selection with the ordered predictors selection (OPS) algorithm, and validate models by using y-randomization and leave-N-out cross validation. An additional new feature is outlier detection carried out by simultaneous comparison of sample leverage with the respective Studentized residuals. The program was developed using Java version 6, and runs on any operating system that supports Java Runtime Environment version 6. The use of the program is illustrated. This program is available for download at lqta. iqm.unicamp.br.Keywords: QSAR models; OPS variable selection; outlier detection. INTRODUÇÃOO estudo das relações quantitativas entre a estrutura química e a atividade biológica ou alguma propriedade físico-química (QSAR/ QSPR) é uma área de destaque hoje na comunidade científica. Por exemplo, na área da físico-química estudos de QSPR são essenciais na predição de propriedades que são difíceis de serem medidas experimentalmente. Já na área de química medicinal teórica, a predição da atividade biológica de novos compostos usando relações matemáticas baseadas em propriedades estruturais, físico-químicas e conformacionais de potenciais agentes previamente testados é um campo de pesquisa extremamente ativo e promissor. Relações QSAR são úteis para entender e explicar o mecanismo de ação de fármacos em nível molecular e permite o projeto e o desenvolvimento de novos compostos com propriedades biológicas desejáveis. 1 Um modelo quantitativo QSAR (ou QSPR) é representado por meio de uma equação matemática que relaciona as propriedades dos compostos investigados com suas atividades biológicas e que possui significância estatística. Essa equação deve não somente possuir um bom poder de predição, mas deve também ser validada mostrando-se robusta e não obtida ao acaso. 2-7Existem diversos programas disponíveis na literatura que podem ser utilizados para gerar modelos QSAR. 17 A Tabela 1 mostra uma comparação das principais características presentes no programa QSAR modeling com os programas supracitados. É notório que dentre os programas livres, apenas o QSAR modeling incorpora todos os testes sugeridos na literatura para a validação 3 e obtenção de modelos robustos, não obtidos por correlações espúrias e com a avaliação crítica dos compostos com comportamento atípico.Neste trabalho, é apresentado um novo programa open source, denominado QSAR modeling, cujo objetivo é construir e validar modelos de QSAR utilizando as ferramentas quimiométricas. Esse é o primeiro programa que implementa o método de seleção de variáveis recentemente desenvolvido ordered predictors selection (OPS), 18 incorpora os processos de vali...
An evaluation of computational performance and precision regarding the cross-validation error of five partial least squares (PLS) algorithms (NIPALS, modified NIPALS, Kernel, SIMPLS and bidiagonal PLS), available and widely used in the literature, is presented. When dealing with large data sets, computational time is an important issue, mainly in cross-validation and variable selection. In the present paper, the PLS algorithms are compared in terms of the run time and the relative error in the precision obtained when performing leave-one-out cross-validation using simulated and real data sets. The simulated data sets were investigated through factorial and Latin square experimental designs. The evaluations were based on the number of rows, the number of columns and the number of latent variables. With respect to their performance, the results for both simulated and real data sets have shown that the differences in run time are statistically different. PLS bidiagonal is the fastest algorithm, followed by Kernel and SIMPLS. Regarding cross-validation error, all algorithms showed similar results. However, in some situations as, for example, when many latent variables were in question, discrepancies were observed, especially with respect to SIMPLS.
A web-based application is developed to generate 4D-QSAR descriptors using the LQTA-QSAR methodology, based on molecular dynamics (MD) trajectories and topology information retrieved from the GROMACS package. The LQTAGrid module calculates the intermolecular interaction energies at each grid point, considering probes and all aligned conformations resulting from MD simulations. These interaction energies are the independent variables or descriptors employed in a QSAR analysis. A friendly front end web interface, built using the Django framework and Python programming language, integrates all steps of the LQTA-QSAR methodology in a way that is transparent to the user, and in the backend, GROMACS and LQTAGrid are executed to generate 4D-QSAR descriptors to be used later in the process of QSAR model building. © 2018 Wiley Periodicals, Inc.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.