How to approach machine learning-based prediction of drug/compound–target interactions

Guvenilir, Heval Atas; Doğan, Tunca

doi:10.1186/s13321-023-00689-w

Cited by 14 publications

(14 citation statements)

References 88 publications

(117 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ProtBENCH [30] contains protein family‐specific bioactivity data, spanning multiple protein superfamilies, including membrane receptors, ion channels, transporters, transcription factors, epigenetic regulators, and enzymes with five subgroups (i. e., transferases, proteases, hydrolases, oxidoreductases, and other enzymes). The family subsets vary in number of interactions (19 K—220 K), number of proteins (100—1 K), and number of compounds (10 K—120 K).…”

Section: Methodsmentioning

confidence: 99%

Exploring data‐driven chemical SMILES tokenization approaches to identify key protein–ligand binding moieties

Temizer,

Uludoğan,

Özçelik

et al. 2024

Molecular Informatics

View full text Add to dashboard Cite

Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence‐based models often segment molecular sequences into pieces called chemical words ,analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data‐driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language‐inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf‐idf weighting. The experiments on multiple protein‐ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.

show abstract

Section: Methodsmentioning

confidence: 99%

Exploring data‐driven chemical SMILES tokenization approaches to identify key protein–ligand binding moieties

Temizer,

Uludoğan,

Özçelik

et al. 2024

Molecular Informatics

View full text Add to dashboard Cite

show abstract

“…The pre-trained model from the DrugBank dataset underwent fine-tuning using the random forest regression method, and the learning rate was selected from the range [1e-5, 1e-4, 4e-4, 1e-3]. Furthermore, different batch sizes, namely [8, 16, 32], were experimented with. To ensure robustness, the five-fold cross-validation technique was utilized.…”

Section: Methodsmentioning

confidence: 99%

“…This approach allowed for the generation of negative samples, resulting in a balanced dataset for analysis. Dissimilar-compound-split dataset: This dataset is based on protein familyspecific datasets (Large-scale) [32], further constructed by applying a strategy that only considers compound similarities while distributing bioactivity data points into train-test splits, as presented in Table 2. Compounds in train and test splits are dissimilar (Tanimoto score < 0.5).…”

Section: Dataset and Evaluation Metricsmentioning

confidence: 99%

MocFormer: A Two-Stage Pre-training-Driven Transformer for Drug-Target Interactions Prediction

Zhang,

Wang,

Guan

et al. 2023

Preprint

View full text Add to dashboard Cite

Numerous deep learning (DL) methods have been proposed to identify drug-target interactions (DTIs). However, these methods often face challenges due to the diversity and complexity of drugs and proteins and the presence of noise and bias in the data. Limited labeled data and extracting meaningful features from datasets also pose difficulties. These limitations hinder the development of accurate and general deep-learning models for DTI prediction. To address these challenges, a novel framework is introduced for identifying DTIs. The framework incorporates pre-trained molecular representation models and a transformer module inspired by pre-training. By pre-training the model, it can acquire a more comprehensive feature representation, enabling it to handle the diversity and complexity of drugs and proteins effectively. Moreover, the model mitigates noise and bias in the data by learning general feature representations during pre-training, improving prediction accuracy. In addition to pre-training, a transformer mechanism called MocFormer is proposed. MocFormer extracts feature matrices from drug and protein sequences obtains decision vectors, and makes predictions based on these decision vectors. Experiments were conducted using public datasets from DrugBank to evaluate the framework's effectiveness. The results demonstrate that the proposed framework outperforms state-of-the-art methods regarding accuracy, area under the ROC curve (AUC), recall, and the area under the precision-recall curve (AUPRC). The code for the framework can be accessed from the following GitHub repository: GitHub Repository.

show abstract

“…While the ligand-based approach relies on a sufficient number of known ligands for a given protein; the Molecular docking approach is limited to available 3D protein structures [8]. Conversely, machine learning-based methods have emerged as a highly promising avenue for predicting DPIs [9]- [12].…”

Section: Introductionmentioning

confidence: 99%

Redefining the Game: MVAE-DFDPnet’s Low-Dimensional Embeddings for Superior Drug-Protein Interaction Predictions

Xia,

Wu,

Zhao

et al. 2024

Preprint

View full text Add to dashboard Cite

Precisely predicting drug-protein interactions (DPIs) is pivotal for drug discovery and advancing precision medicine. A significant challenge in this domain is the high-dimensional and heterogeneous data characterizing drug and protein attributes, along with their intricate interactions. In our study, we introduce a novel deep learning architecture: the Multi-view Variational Auto-Encoder embedded within a cascade Deep Forest (MVAE-DFDPnet). This framework adeptly learns ultra-low-dimensional embedding for drugs and proteins. Notably, our t-SNE analysis reveals that two-dimensional embedding can clearly define clusters corresponding to diverse drug classes and protein families. These ultra-low-dimensional embedding likely contribute to the enhanced robustness and generalizability of our MVAE-DFDPnet. Impressively, our model surpasses current leading methods on benchmark datasets, functioning in significantly reduced dimensional spaces. The model's resilience is further evidenced by its sustained accuracy in predicting interactions involving novel drugs, proteins, and drug classes. Additionally, we have corroborated several newly identified DPIs with experimental evidence from the scientific literature. The code used to generate and analyze these results can be accessed from https://github.com/Macau-LYXia/MVAE-DFDPnet-V2 .

show abstract

How to approach machine learning-based prediction of drug/compound–target interactions

Cited by 14 publications

References 88 publications

Exploring data‐driven chemical SMILES tokenization approaches to identify key protein–ligand binding moieties

Exploring data‐driven chemical SMILES tokenization approaches to identify key protein–ligand binding moieties

MocFormer: A Two-Stage Pre-training-Driven Transformer for Drug-Target Interactions Prediction

Redefining the Game: MVAE-DFDPnet’s Low-Dimensional Embeddings for Superior Drug-Protein Interaction Predictions

Contact Info

Product

Resources

About