Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model

Born, Jannis; Huynh, Tien; Stroobants, Astrid; Cornell, Wendy D.; Manica, Matteo

doi:10.1021/acs.jcim.1c00889

Cited by 20 publications

(36 citation statements)

References 84 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The experimental setup is largely identical to the binding affinity prediction task described in ref ( 7 ). We take data from BindingDB 11 and examine two types of models, a k -nearest-neighbor (KNN) model that builds a joint similarity space of protein and ligand distances and a deep neural network called BiMCA (Bimodal Multiscale Convolutional Attention encoder 12 ) that ingests protein and ligand sequences (SMILES strings) and consists of convolutional and attention layers.…”

Section: Methodsmentioning

confidence: 99%

“…In our previous work, 7 the active site representation relied on 29 residues defined originally in Sheridan et al [ref ( 8 ), Table 1]. These residues are short contiguous subsequences that lie discontiguously in the original sequence (cf.…”

Section: Kinase Sequence Representationmentioning

confidence: 99%

“…The superiority of the active site representation manifested consistently across all ligand types, with the sole exception of one drug class: MEK/MAPK inhibitors. 7 Notably, this class contains many allosteric binders, in particular ATP-noncompetitive MAPK inhibitors that bind to a unique site near the ATP-binding pocket. 9 One goal of the presented work is to address this systematic limitation in modeling allosteric binders and refine the definition of an “active site” for binding affinity prediction.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

Born

Shoshan

Huynh

et al. 2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

Recent work showed that active site rather than full-protein-sequence information improves predictive performance in kinase-ligand binding affinity prediction. To refine the notion of an “active site”, we here propose and compare multiple definitions. We report significant evidence that our novel definition is superior to previous definitions and better models of ATP-noncompetitive inhibitors. Moreover, we leverage the discontiguity of the active site sequence to motivate novel protein-sequence augmentation strategies and find that combining them further improves performance.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Kinase Sequence Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

Born

Shoshan

Huynh

et al. 2022

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is consistent with previous studies, which showed that the use of active site sequences can improve the prediction of the affinity. 68 , 69 For the occurrence of amino acids in the protein active site, it can be seen that His, Gly, Tyr, and Trp rank in the top position ( Figure S3 ), probably because they can form hydrogen bonds with ligands and thus play key roles in ligand binding. Leu and Phe also rank quite high, possibly due to their contribution to the formation of hydrophobic pockets.…”

Section: Resultsmentioning

confidence: 99%

XLPFE: A Simple and Effective Machine Learning Scoring Function for Protein–Ligand Scoring and Ranking

Dong

Wang

2022

ACS Omega

View full text Add to dashboard Cite

Prediction of protein–ligand binding affinities is a central issue in structure-based computer-aided drug design. In recent years, much effort has been devoted to the prediction of the binding affinity in protein–ligand complexes using machine learning (ML). Due to the remarkable ability of ML methods in nonlinear fitting, ML-based scoring functions (SFs) can deliver much improved performance on a selected test set, such as the comparative assessment of scoring functions (CASF), when compared to the classical SFs. However, the performance of ML-based SFs heavily relies on the overall similarity of the training set and the test set. To improve the performance and transferability of an SF, we have tried to combine various features including energy terms from X-score and AutoDock Vina, the properties of ligands, and the statistical sequence-related information from either the binding site or the full protein. In conjunction with extreme trees (ET), an ML model, we have developed XLPFE, a new SF. Compared with other tested methods such as X-score, AutoDock Vina, ΔvinaXGB, PSH-ML, or CNN-score, XLPFE achieves consistently better scoring and ranking power for various types of protein–ligand complex structures beyond the CASF, suggesting that XLPFE has superior transferability. In particular, XLPFE performs better with metalloenzymes. With its faster speed, improved accuracy, and better transferability, XLPFE could be usefully applied to a diverse range of protein–ligand complexes.

show abstract

“…All the datasets in this work are derived from BindingDB (Liu et al, 2007): a publicly accessible and regularly updated collection of binding affinity values between proteins considered to be drugtargets, and drug-like molecules. In particular, we adopt two benchmark datasets derived from BindingDB, one released by Yingkai Gao et al (2018) and the other as defined by Karimi et al (2019), which have been used for benchmarking recent DTI predictors (Chen et al, 2020; Born et al, 2022). Both benchmark datasets are outlined in Table 1.…”

Section: Methodsmentioning

confidence: 99%

Improving the Assessment of Deep Learning Models in the Context of Drug-Target Interaction Prediction

Torrisi¹,

Léon²,

Climent³

et al. 2022

Preprint

View full text Add to dashboard Cite

Machine Learning techniques have been widely adopted to predict drug-target interactions, a central area of research in early drug discovery. These techniques have shown promising results on various benchmarks although they tend to suffer from poor generalization. This is typically related to very sparse and nonuniform datasets available, which limits the applicability domain of machine learning techniques. Moreover, widespread approaches to split datasets (into training and test sets) treat a drug-target interaction as an independent entities, when in reality the drug and target involved may take part in other interactions, breaking apart the assumption of independence. We observe that this leads to overly optimistic test results and poor generalization of out-of-distribution samples for various state-of-the-art sequence-based machine learning models for drug-target prediction. We show that previous approaches to reduce bias in binding datasets focus on drug or target information only and, thus, lead to similar pitfalls. Finally, we propose a minimum viable solution to evaluate the generalization capability of a machine learning model based on the systematic separation of test samples with respect to drugs and targets in the training set, thus discerning the three out-of-distribution scenarios seen at test time: (1) drug or (2) target present in the training set, or (3) neither.

show abstract

Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model

Cited by 20 publications

References 84 publications

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction

XLPFE: A Simple and Effective Machine Learning Scoring Function for Protein–Ligand Scoring and Ranking

Improving the Assessment of Deep Learning Models in the Context of Drug-Target Interaction Prediction

Contact Info

Product

Resources

About