This study explores
the research area of drug solubility
in lipid
excipients, an area persistently complex despite recent advancements
in understanding and predicting solubility based on molecular structure.
To
this end, this research investigated novel descriptor sets, employing
machine learning techniques to understand the determinants governing
interactions between solutes and medium-chain triglycerides (MCTs).
Quantitative structure-property relationships (QSPR) were constructed
on an extended solubility data set comprising 182 experimental values
of structurally diverse drug molecules, including both development
and marketed drugs to extract meaningful property relationships. Four
classes of molecular descriptors, ranging from traditional representations
to complex geometrical descriptions, were assessed and compared in
terms of their predictive accuracy and interpretability. These include
two-dimensional (2D) and three-dimensional (3D) descriptors, Abraham
solvation parameters, extended connectivity fingerprints (ECFPs),
and the smooth overlap of atomic position (SOAP) descriptor. Through
testing three distinct regularized regression algorithms alongside
various preprocessing schemes, the SOAP descriptor enabled the construction
of a superior performing model in terms of interpretability and accuracy.
Its atom-centered characteristics allowed contributions to be estimated
at the atomic level, thereby enabling the ranking of prevalent molecular
motifs and their influence on drug solubility in MCTs. The performance
on a separate test set demonstrated high predictive accuracy (RMSE
= 0.50) for 2D and 3D, SOAP, and Abraham Solvation descriptors. The
model trained on ECFP4 descriptors resulted in inferior predictive
accuracy. Lastly, uncertainty estimations for each model were introduced
to assess their applicability domains and provide information on where
the models may extrapolate in chemical space and, thus, where more
data may be necessary to refine a data-driven approach to predict
solubility in MCTs. Overall, the presented approaches further enable
computationally informed formulation development by introducing a
novel in silico approach for rational drug development and prediction
of dose loading in lipids.