Jincai Yang scite author profile

Carbohydrate-active enzymes (CAZymes) are very important to the biotech industry, particularly the emerging biofuel industry because CAZymes are responsible for the synthesis, degradation and modification of all the carbohydrates on Earth. We have developed a web resource, dbCAN (http://csbl.bmb.uga.edu/dbCAN/annotate.php), to provide a capability for automated CAZyme signature domain-based annotation for any given protein data set (e.g. proteins from a newly sequenced genome) submitted to our server. To accomplish this, we have explicitly defined a signature domain for every CAZyme family, derived based on the CDD (conserved domain database) search and literature curation. We have also constructed a hidden Markov model to represent the signature domain of each CAZyme family. These CAZyme family-specific HMMs are our key contribution and the foundation for the automated CAZyme annotation.

show abstract

Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets

Yang

2020

View full text Add to dashboard Cite

Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examined the model performance of atomic convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R 2 of 0.73 between experimental and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets containing only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topology biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topology bias still exists due to the use of molecular fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions.

show abstract

Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening

Zhu

Yang

Huang

2022

J. Chem. Inf. Model.

View full text Add to dashboard Cite

In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein−ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfambased clustering (Pfam-cluster) approach was developed to assess the crosstarget generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random crossvalidation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solventaccessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein− ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.

show abstract

Systematic Investigation of Docking Failures in Large-Scale Structure-Based Virtual Screening

et al. 2022

View full text Add to dashboard Cite

In recent years, large-scale structure-based virtual screening has attracted increasing levels of interest for identification of novel compounds corresponding to potential drug targets. It is critical to understand the strengths and weaknesses of docking algorithms to increase the success rate in practical applications. Here, we systematically investigated the docking successes and failures of two representative docking programs: UCSF DOCK 3.7 and AutoDock Vina. DOCK 3.7 performed better in early enrichment on the Directory of Useful Decoys: Enhanced (DUD-E) data set, although both docking methods were roughly comparable in overall enrichment performance. DOCK 3.7 also showed superior computational efficiency. Intriguingly, the Vina scoring function showed a bias toward compounds with higher molecular weights. Both the tested docking approaches yielded incorrectly predicted ligand binding poses caused by the limitations of torsion sampling. Based on a careful analysis of docking results from six representative cases, we propose the reasons underlying docking failures; furthermore, we provide a few solutions, representing practical guidance for large-scale virtual screening campaigns and future docking algorithm development.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jincai Yang

dbCAN: a web resource for automated carbohydrate-active enzyme annotation

Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets

Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening

Systematic Investigation of Docking Failures in Large-Scale Structure-Based Virtual Screening

Contact Info

Product

Resources

About