Solubility
is a key metric for therapeutic compounds. Conversely,
insoluble compounds cloud the accuracy of assays at all stages of
chemical biology and drug discovery. Herein, we disclose naïve
Bayesian classifier models to predict aqueous solubility. Publicly
accessible aqueous solubility data were used to create two full, or
nonpruned, training sets. These two sets were also combined to create
a full fused set, and a training set comprised of a literature collation
of solubility data was also considered as a reference. We tested different
extents of data pruning on the training sets and constructed machine
learning models that were evaluated with two independent, external
test sets that contained compounds that were different from the training
sets. The best pruned and fused model was significantly more accurate,
in comparison to either the full model or the full fused model, with
the prediction of these external test sets. By carefully removing
data from the training set, less information can be used to create
more accurate machine learning models for aqueous solubility. This
knowledge and the curated training sets should prove useful to future
machine learning approaches.