How to produce expressive molecular representations is a fundamental challenge in artificial intelligence-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and poor generalization capability. Here, we propose a novel molecular pre-training graph-based deep learning framework, named MPG, that learns molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful GNN for modelling molecular graph named MolGNet, and designed an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemical insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction and drug-target interaction, on 14 benchmark datasets. The pre-trained MolGNet in MPG has the potential to become an advanced molecular encoder in the drug discovery pipeline.
An accurate prediction of NMR chemical shifts at affordable computational cost is very important for different types of structural assignments in experimental studies. Density functional theory (DFT) and gauge-including atomic orbital (GIAO) are two of the most popular computational methods for NMR calculation, yet they often fail to resolve ambiguities in structural assignments. Here, we present a new method that uses machine learning (ML) techniques (DFT + ML) that significantly increases the accuracy of 13 C/ 1 H NMR chemical shift prediction for a variety of organic molecules. The input of the generalizable DFT + ML model contains two critical parts: one is a vector providing insights into chemical environments, which can be evaluated without knowing the exact geometry of the molecule; the other one is the DFT-calculated isotropic shielding constant. The DFT + ML model was trained with a data set containing 476 13 C and 270 1 H experimental chemical shifts. For the DFT methods used here, the root mean square deviations (RMSDs) for the errors between predicted and experimental 13 C/ 1 H chemical shifts can be as small as 2.10/0.18 ppm, which is much lower than those from simple DFT (5.54/0.25 ppm), or DFT + linear regression (LR) (4.77/0.23 ppm) approaches. It also has a smaller maximum absolute error than two previously proposed NMR-predicting ML models. The robustness of the DFT + ML model is tested on two classes of organic molecules (TIC10 and hyacinthacines), where the correct isomers were unambiguously assigned to the experimental ones. Overall, the DFT + ML model shows promise for structural assignments in a variety of systems, including stereoisomers, that are often challenging to determine experimentally.
An efficient, yet accurate, computational protocol for predicting nitrogen nuclear magnetic resonance (NMR) chemical shifts based on density functional theory and the gauge-including atomic orbital approach is proposed. A database of small and relatively rigid compounds containing nitrogen atoms is compiled. Scaling factors for the linear correlation between experimental 15 N chemical shifts and calculated isotropic shielding constants are systematically investigated with seven different levels of theory in both chloroform and dimethyl sulfoxide, two commonly used solvents for NMR experiments. The best method yields a root-mean-square deviation of about 5.30 and 7.00 ppm in CHCl 3 and dimethyl sulfoxide (DMSO), respectively. Moreover, another set of scaling factors for -NH 2 chemical shifts is also proposed based on a separate database with three levels of theory. Furthermore, it is encouraging that a reasonable transferability for the linear correlation is found between these two solvents. This finding will enable broader applications of the developed empirical scaling factors to other commonly used solvents in NMR experiments. The consistency between theoretical predictions and experimental results for structural elucidations is illustrated for selected examples including regioisomers, tautomers, oxidation states, and protonated structures.
11
B nuclear magnetic resonance (NMR) spectroscopy is
a useful tool for studies of boron-containing compounds in terms of
structural analysis and reaction kinetics monitoring. A computational
protocol, which is aimed at an accurate prediction of
11
B NMR chemical shifts via linear regression, was proposed based on
the density functional theory and the gauge-including atomic orbital
approach. Similar to the procedure used for carbon, hydrogen, and
nitrogen chemical shift predictions, a database of boron-containing
molecules was first compiled. Scaling factors for the linear regression
between calculated isotropic shielding constants and experimental
chemical shifts were then fitted using eight different levels of theory
with both the solvation model based on density and conductor-like
polarizable continuum model solvent models. The best method with the
two solvent models yields a root-mean-square deviation of about 3.40
and 3.37 ppm, respectively. To explore the capabilities and potential
limitations of the developed protocols, classical boron–hydrogen
compounds and molecules with representative boron bonding environments
were chosen as test cases, and the consistency between experimental
values and theoretical predictions was demonstrated.
Background:
Thermophilic proteins can maintain good activity under high temperature, so it is important to study thermophilic proteins for the thermal stability of proteins.
Objective:
In order to solve the problem of low precision and low efficiency in predicting thermophilic proteins, a prediction method based on feature fusion and machine learning was proposed in this paper.
Method:
For the selected thermophilic data sets, firstly, the thermophilic protein sequence was characterized based on feature fusion by the combination of g-gap dipeptide, entropy density and autocorrelation coefficient. Then, kernel principal component analysis (KPCA) was used to reduce the dimension of the expressed protein sequence features in order to reduce training time and improve efficiency. Finally, the classification model was designed by using classification algorithm.
Results:
A variety of classification algorithms were used to train and test on the selected thermophilic dataset. By comparison, the accuracy of the support vector machine (SVM) under the jackknife method was over 92%. The combination of other evaluation indicators also proved that the SVM performance was the best.
Conclusion:
Because of choosing an effectively feature representation method and a robust classifier, the proposed method is suitable for predicting thermophilic proteins and is superior to most reported methods.
Deep learning based methods have been widely applied to predict various kinds of molecular properties in pharmaceutical industry with increasingly more success. In this study, we propose two predictive models...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.