A Novel Automated Framework for QSAR Modeling of Highly Imbalanced <i>Leishmania</i> High-Throughput Screening Data

Casanova-Alvarez, Omar; Helguera, Aliuska Morales; Cabrera‐Pérez, Miguel Ángel; Ruiz, Reinaldo Molina; Molina, Christophe

doi:10.1021/acs.jcim.0c01439

Cited by 11 publications

(11 citation statements)

References 52 publications

(90 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To determine the most relevant variables for the classification of SCAMs based on a DT strategy, the same variable selection algorithm as the one published in the paper by Casanova-Alvarez et al . was implemented: a selection of variables by permutation using a decision tree algorithm combined with a recursive selection of least correlated variables. This algorithm is encapsulated in a component named “variable selection by decision tree”.…”

Section: Methodsmentioning

confidence: 99%

“…Strata can be gathered based on both the partial level and incremental level of statistic performance. Our strategy is an extension from 1D to 2D stratification of work recently published by Casanova-Alvarez and coauthors and an evolution in terms of classification and recursive variable selection of work published by Sheridan. , ISE is a machine learning model agnostic and thus can be used in combination to any classification base model as illustrated in this study. ISE predictive models are implemented using the KNIME open-source software, and their workflows for prediction with full data are available in open-access…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation

Molina¹,

Ait-Ouarab²,

Minoux

2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

Partial and incremental stratification analysis of a quantitative structure-interference relationship (QSIR) is a novel strategy intended to categorize classification provided by machine learning techniques. It is based on a 2D mapping of classification statistics onto two categorical axes: the degree of consensus and level of applicability domain. An internal cross-validation set allows to determine the statistical performance of the ensemble at every 2D map stratum and hence to define isometric local performance regions with the aim of better hit ranking and selection. During training, isometric stratified ensembles (ISE) applies a recursive decorrelated variable selection and considers the cardinal ratio of classes to balance training sets and thus avoid bias due to possible class imbalance. To exemplify the interest of this strategy, three different highly imbalanced PubChem pairs of AmpC β-lactamase and cruzain inhibition assay campaigns of colloidal aggregators and complementary aggregators data set available at the AGGREGATOR ADVISOR predictor web page were employed. Statistics obtained using this new strategy show outperforming results compared to former published tools, with and without a classical applicability domain. ISE performance on classifying colloidal aggregators shows from a global AUC of 0.82, when the whole test data set is considered, up to a maximum AUC of 0.88, when its highest confidence isometric stratum is retained.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation

Molina¹,

Ait-Ouarab²,

Minoux

2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

show abstract

“…While there are several studies investigating resampling in the context of bioassay modelling [ 5 , 28 – 30 ], changing the training objective has not been thoroughly investigated thus far. This study directly addresses this gap by investigating the effectiveness of a variety of recently published imbalance-insensitive loss functions for training Gradient Boosting classifiers.…”

Section: Introductionmentioning

confidence: 99%

Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions

et al. 2022

View full text Add to dashboard Cite

While in the last years there has been a dramatic increase in the number of available bioassay datasets, many of them suffer from extremely imbalanced distribution between active and inactive compounds. Thus, there is an urgent need for novel approaches to tackle class imbalance in drug discovery. Inspired by recent advances in computer vision, we investigated a panel of alternative loss functions for imbalanced classification in the context of Gradient Boosting and benchmarked them on six datasets from public and proprietary sources, for a total of 42 tasks and 2 million compounds. Our findings show that with these modifications, we achieve statistically significant improvements over the conventional cross-entropy loss function on five out of six datasets. Furthermore, by employing these bespoke loss functions we are able to push Gradient Boosting to match or outperform a wide variety of previously reported classifiers and neural networks. We also investigate the impact of changing the loss function on training time and find that it increases convergence speed up to 8 times faster. As such, these results show that tuning the loss function for Gradient Boosting is a straightforward and computationally efficient method to achieve state-of-the-art performance on imbalanced bioassay datasets without compromising on interpretability and scalability. Graphical Abstract

show abstract

“…In another contribution, an automated workflow was created to build a classification-based model for diverse and imbalanced data sets. 38 This workflow was tested using a data set composed of 196 173 compounds, with 1063 compounds displaying antileishmanial activity. Six different methods were tested to build a consensus model, and the model using decision trees had the best performance.…”

mentioning

confidence: 99%

“…DLCA showed a better performance for the two data sets, compared to other consensus approaches. In another contribution, an automated workflow was created to build a classification-based model for diverse and imbalanced data sets . This workflow was tested using a data set composed of 196 173 compounds, with 1063 compounds displaying antileishmanial activity.…”

mentioning

confidence: 99%

The (Re)-Evolution of Quantitative Structure–Activity Relationship (QSAR) Studies Propelled by the Surge of Machine Learning Methods

Soares

Nunes-Alves

Mazzolari

et al. 2022

J. Chem. Inf. Model.

View full text Add to dashboard Cite

A Novel Automated Framework for QSAR Modeling of Highly Imbalanced Leishmania High-Throughput Screening Data

Cited by 11 publications

References 52 publications

Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation

Isometric Stratified Ensembles: A Partial and Incremental Adaptive Applicability Domain and Consensus-Based Classification Strategy for Highly Imbalanced Data Sets with Application to Colloidal Aggregation

Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions

The (Re)-Evolution of Quantitative Structure–Activity Relationship (QSAR) Studies Propelled by the Surge of Machine Learning Methods

Contact Info

Product

Resources

About