The n-octanol/buffer solution distribution
coefficient
at pH = 7.4 (log D
7.4) is an indicator
of lipophilicity, and it influences a wide variety of absorption,
distribution, metabolism, excretion, and toxicity (ADMET) properties
and druggability of compounds. In log D
7.4 prediction, graph neural networks (GNNs) can uncover subtle
structure–property relationships (SPRs) by automatically extracting
features from molecular graphs that facilitate the learning of SPRs,
but their performances are often limited by the small size of available
datasets. Herein, we present a transfer learning strategy called pretraining
on computational data and then fine-tuning on experimental data (PCFE)
to fully exploit the predictive potential of GNNs. PCFE works by pretraining
a GNN model on 1.71 million computational log D data (low-fidelity data) and then fine-tuning it on 19,155 experimental
log D
7.4 data (high-fidelity data).
The experiments for three GNN architectures (graph convolutional network
(GCN), graph attention network (GAT), and Attentive FP) demonstrated
the effectiveness of PCFE in improving GNNs for log D
7.4 predictions. Moreover, the optimal PCFE-trained
GNN model (cx-Attentive FP, R
test
2 = 0.909) outperformed four excellent descriptor-based models
(random forest (RF), gradient boosting (GB), support vector machine
(SVM), and extreme gradient boosting (XGBoost)). The robustness of
the cx-Attentive FP model was also confirmed by evaluating the models
with different training data sizes and dataset splitting strategies.
Therefore, we developed a webserver and defined the applicability
domain for this model. The webserver () provides free log D
7.4 prediction
services. In addition, the important descriptors for log D
7.4 were detected by the Shapley additive explanations
(SHAP) method, and the most relevant substructures of log D
7.4 were identified by the attention mechanism.
Finally, the matched molecular pair analysis (MMPA) was performed
to summarize the contributions of common chemical substituents to
log D
7.4, including a variety of
hydrocarbon groups, halogen groups, heteroatoms, and polar groups.
In conclusion, we believe that the cx-Attentive FP model can serve
as a reliable tool to predict log D
7.4 and hope that pretraining on low-fidelity data can help GNNs make
accurate predictions of other endpoints in drug discovery.