Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking

Sze, Kam-Heung; Xiong, Zhiqiang; Ma, Jinlong; Lü, Gang; Chan, Wing Cheong; Li, Hongjian

doi:10.5220/0008873800850092

Cited by 2 publications

(17 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite the consistency, we advocate the more robust approach by MM-align, first introduced by Sze et al. [ 8 ].…”

Section: Resultsmentioning

confidence: 99%

“…The approach by Sze et al. [ 8 ] was employed to calculate the similarity in terms of protein structure, ligand fingerprint and pocket topology.…”

Section: Methodsmentioning

confidence: 99%

“…At a specific cutoff, a complex is excluded from the original full training set if its similarity to any of the test complexes is higher than the cutoff. In other words, a complex is included in the training set if its similarity to every test complex is always no greater than the cutoff [ 8 ]. Mathematically, for both protein structure and ligand fingerprint similarities whose values are normalized to [0, 1], a series of new training sets (NTs) were created by gradually removing complexes from the OT according to varying cut-off values given a fixed test set (TS): \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}\begin{equation*} {NT}_{ds}^s(c)=\left\{\ {p}_i\ |\kern0.50em {p}_i\in OT\ and\ \forall{q}_j\in TS,s\left({p}_i,{q}_j\right)\le c\ \right\} \end{equation*}\end{document} where c is the cutoff; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${p}_i$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${q}_j$\end{document} represent the i th and j th complexes from OT and TS, respectively; and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$s({p}_i,{q}_j)$\end{document} is the similarity between \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${p}_i$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${q}_j$\end{document} .…”

Section: Methodsmentioning

confidence: 99%

“…proposed a revised definition of structural similarity between a pair of training and test set proteins, introduced a different measure of binding pocket similarity, and benchmarked three classical SFs and four RF-based SFs on CASF-2013. They found that even if the training set was split into two halves and the half with proteins dissimilar to the test set was used for training, RF-based SFs still produced a smaller prediction error than the best classical SF, thus confirming that dissimilar training complexes may be valuable when allied with appropriate ML approaches and informative descriptors [ 8 ].…”

Section: Introductionmentioning

confidence: 99%

“…Here we have expanded the above six studies from the following perspectives. Firstly, we will demonstrate three examples to show that the method employed by four early works [ 3–6 ] for calculating structural similarity could be error prone, hence a revised method proposed lately [ 8 ] should be advocated. Secondly, in addition to CASF-2016, a blind evaluation was conducted too, where only data available until 2017 were used to construct the SFs that predict the binding affinities of complexes released by 2018 as if these had not been measured hitherto [ 1 ].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark

Lü

Sze

et al. 2021

Briefings in Bioinformatics

Self Cite

View full text Add to dashboard Cite

The superior performance of machine-learning scoring functions for docking has caused a series of debates on whether it is due to learning knowledge from training data that are similar in some sense to the test data. With a systematically revised methodology and a blind benchmark realistically mimicking the process of prospective prediction of binding affinity, we have evaluated three broadly used classical scoring functions and five machine-learning counterparts calibrated with both random forest and extreme gradient boosting using both solo and hybrid features, showing for the first time that machine-learning scoring functions trained exclusively on a proportion of as low as 8% complexes dissimilar to the test set already outperform classical scoring functions, a percentage that is far lower than what has been recently reported on all the three CASF benchmarks. The performance of machine-learning scoring functions is underestimated due to the absence of similar samples in some artificially created training sets that discard the full spectrum of complexes to be found in a prospective environment. Given the inevitability of any degree of similarity contained in a large dataset, the criteria for scoring function selection depend on which one can make the best use of all available materials. Software code and data are provided at https://github.com/cusdulab/MLSF for interested readers to rapidly rebuild the scoring functions and reproduce our results, even to make extended analyses on their own benchmarks.

show abstract

“…Despite the consistency, we advocate the more robust approach by MM-align, first introduced by Sze et al. [ 8 ].…”

Section: Resultsmentioning

confidence: 99%

“…The approach by Sze et al. [ 8 ] was employed to calculate the similarity in terms of protein structure, ligand fingerprint and pocket topology.…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark

Lü

Sze

et al. 2021

Briefings in Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

Modern machine‐learning for binding affinity estimation of protein–ligand complexes: Progress, opportunities, and challenges

Harren,

Gutermuth,

Grebner

et al. 2024

WIREs Comput Mol Sci

View full text Add to dashboard Cite

Structure‐based drug design is a widely applied approach in the discovery of new lead compounds for known therapeutic targets. In most structure‐based drug design applications, the docking procedure is considered the crucial step. Here, a potential ligand is fitted into the binding site, and a scoring function assesses its binding capability. With the rise of modern machine‐learning in drug discovery, novel scoring functions using machine‐learning techniques achieved significant performance gains in virtual screening and ligand optimization tasks on retrospective data. However, real‐world applications of these methods are still limited. Missing success stories in prospective applications are one reason for this. Additionally, the fast‐evolving nature of the field makes it challenging to assess the advantages of each individual method. This review will highlight recent strides toward improved real world applicability of machine‐learning based scoring, enabling a better understanding of the potential benefits and pitfalls of these functions on a project. Furthermore, a systematic way of classifying machine‐learning based scoring that facilitates comparisons will be presented.This article is categorized under: Data Science > Chemoinformatics Data Science > Artificial Intelligence/Machine Learning Software > Molecular Modeling

show abstract

Influence of Data Similarity on the Scoring Power of Machine-learning Scoring Functions for Docking

Cited by 2 publications

References 0 publications

Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark

Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark

Modern machine‐learning for binding affinity estimation of protein–ligand complexes: Progress, opportunities, and challenges

Contact Info

Product

Resources

About