Machine learning methods for pKa prediction of small molecules: Advances and challenges

Wu, Jialu; Kang, Yu; Pan, Peichen; Hou, Tingjun

doi:10.1016/j.drudis.2022.103372

Cited by 25 publications

(27 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where δ i is 1 for acids and −1 for bases. However, accurate prediction of pK a is still quite difficult due to the scarcity of data and inherent complexity of the property, 17 making direct application of this formula a great challenge. 18 In recent years, with the development of machine learning (ML) algorithms, many handcrafted descriptor-based ML models have been developed for log D 7.4 prediction, such as random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost).…”

Section: ■ Introductionmentioning

confidence: 99%

“…Experimental approaches such as the shake flask method, filter probe method, “slow stirring” method, chromatography method, and potentiometric titration method are costly and time-consuming. − On the other hand, log D 7.4 can be derived from log P and p K a as follows log nobreak0em.25em⁡ D false( pH false) = log nobreak0em.25em⁡ P − log ( 1 + 10 false( pH − normalp K normala false) δ i ) where δ i is 1 for acids and −1 for bases. However, accurate prediction of p K a is still quite difficult due to the scarcity of data and inherent complexity of the property, making direct application of this formula a great challenge …”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improved GNNs for Log D_7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Duan

Zhang

et al. 2023

J. Chem. Inf. Model.

View full text Add to dashboard Cite

The n-octanol/buffer solution distribution coefficient at pH = 7.4 (log D 7.4) is an indicator of lipophilicity, and it influences a wide variety of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties and druggability of compounds. In log D 7.4 prediction, graph neural networks (GNNs) can uncover subtle structure–property relationships (SPRs) by automatically extracting features from molecular graphs that facilitate the learning of SPRs, but their performances are often limited by the small size of available datasets. Herein, we present a transfer learning strategy called pretraining on computational data and then fine-tuning on experimental data (PCFE) to fully exploit the predictive potential of GNNs. PCFE works by pretraining a GNN model on 1.71 million computational log D data (low-fidelity data) and then fine-tuning it on 19,155 experimental log D 7.4 data (high-fidelity data). The experiments for three GNN architectures (graph convolutional network (GCN), graph attention network (GAT), and Attentive FP) demonstrated the effectiveness of PCFE in improving GNNs for log D 7.4 predictions. Moreover, the optimal PCFE-trained GNN model (cx-Attentive FP, R test 2 = 0.909) outperformed four excellent descriptor-based models (random forest (RF), gradient boosting (GB), support vector machine (SVM), and extreme gradient boosting (XGBoost)). The robustness of the cx-Attentive FP model was also confirmed by evaluating the models with different training data sizes and dataset splitting strategies. Therefore, we developed a webserver and defined the applicability domain for this model. The webserver () provides free log D 7.4 prediction services. In addition, the important descriptors for log D 7.4 were detected by the Shapley additive explanations (SHAP) method, and the most relevant substructures of log D 7.4 were identified by the attention mechanism. Finally, the matched molecular pair analysis (MMPA) was performed to summarize the contributions of common chemical substituents to log D 7.4, including a variety of hydrocarbon groups, halogen groups, heteroatoms, and polar groups. In conclusion, we believe that the cx-Attentive FP model can serve as a reliable tool to predict log D 7.4 and hope that pretraining on low-fidelity data can help GNNs make accurate predictions of other endpoints in drug discovery.

show abstract

Section: ■ Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improved GNNs for Log D_7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Duan

Zhang

et al. 2023

J. Chem. Inf. Model.

View full text Add to dashboard Cite

show abstract

“…It prevents time- and cost-consuming experiments, suggesting easily screenable sets of data. Computational calculation of acid–base dissociation constants (pKa) is among the most embraced methods [ 8 , 9 ].…”

Section: Introductionmentioning

confidence: 99%

An Accurate Approach for Computational pKa Determination of Phenolic Compounds

et al. 2022

View full text Add to dashboard Cite

Computational chemistry is a valuable tool, as it allows for in silico prediction of key parameters of novel compounds, such as pKa. In the framework of computational pKa determination, the literature offers several approaches based on different level of theories, functionals and continuum solvation models. However, correction factors are often used to provide reliable models that adequately predict pKa. In this work, an accurate protocol based on a direct approach is proposed for computing phenols pKa. Importantly, this methodology does not require the use of correction factors or mathematical fitting, making it highly practical, easy to use and fast. Above all, DFT calculations performed in the presence two explicit water molecules using CAM-B3LYP functional with 6-311G+dp basis set and a solvation model based on density (SMD) led to accurate pKa values. In particular, calculations performed on a series of 13 differently substituted phenols provided reliable results, with a mean absolute error of 0.3. Furthermore, the model achieves accurate results with -CN and -NO2 substituents, which are usually excluded from computational pKa studies, enabling easy and reliable pKa determination in a wide range of phenols.

show abstract

“…QSAR based on machine learning can exploit different approaches ranging from the descriptor models (i. e. atomic descriptor, rooted fingerprints, and hybrid features) to the graph-based models, organized in kernels and neural networks. [3] The former cover simple linear regression complex to neural network, while the latter still require a remarkable number of steps in the identification of compound resides, the overall optimization of structures, and algorithms. [3] Among physics-based models, the ab initio bond length high correlation subsets (AIBLHiCoS or AIBL) method gave promising results for the pKa prediction of a wide range of complex organic molecules.…”

Section: Introductionmentioning

confidence: 99%

Easy to Use DFT Approach for Computational pKa Determination of Carboxylic Acids

Pezzola,

Venanzi,

Galloni

et al. 2023

Chemistry A European J

View full text Add to dashboard Cite

In pKa computational determination, the challenge in exploring and fostering new methodologies and approaches goes in parallel with the amelioration of computational performances. In this paper a “ready to use methodology” has been compared to other strategies, such as the re‐shaping in solvation cavity (Bondi radius re‐shaping), wanting to assess its reliability in predicting the pKa of a broad list of carboxylic acids. Thus, the functionals B3LYP and CAM‐B3LYP have been selected, using SMD as continuum solvation model. Exploiting our previous results, two water molecules were made explicit on the reaction centre. Data show that our model (CAM‐B3LYP/2H2O) is capable to accurately predict pKa, leading to mean absolute error (MAE) values lower than 0.5. Noteworthy, good results were achieved in computing the pKa of substituents bearing nitro and cyano groups. Focusing on B3LYP, eventually remarkable outputs were obtained only when Bondi correction was applied to the complex with two water molecules. Hence, massive outcomes were obtained in foreseeing the trichloro and trifluoro acetic acid pKa. These findings demonstrated that no complex level of theory nor external factor is required to accurately predict carboxylic acids pKa, with MAE well below 0.5 units.

show abstract

Machine learning methods for pKa prediction of small molecules: Advances and challenges

Cited by 25 publications

References 59 publications

Improved GNNs for Log D_7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Improved GNNs for Log D_7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

An Accurate Approach for Computational pKa Determination of Phenolic Compounds

Easy to Use DFT Approach for Computational pKa Determination of Carboxylic Acids

Contact Info

Product

Resources

About

Machine learning methods for pKa prediction of small molecules: Advances and challenges

Cited by 25 publications

References 59 publications

Improved GNNs for Log D7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Improved GNNs for Log D7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

An Accurate Approach for Computational pKa Determination of Phenolic Compounds

Easy to Use DFT Approach for Computational pKa Determination of Carboxylic Acids

Contact Info

Product

Resources

About

Improved GNNs for Log D_7.4 Prediction by Transferring Knowledge from Low-Fidelity Data

Improved GNNs for Log D_7.4 Prediction by Transferring Knowledge from Low-Fidelity Data