The development of toxicity classification models using
the ToxCast
database has been extensively studied. Machine learning approaches
are effective in identifying the bioactivity of untested chemicals.
However, ToxCast assays differ in the amount of data and degree of
class imbalance (CI). Therefore, the resampling algorithm employed
should vary depending on the data distribution to achieve optimal
classification performance. In this study, the effects of CI and data
scarcity (DS) on the performance of binary classification models were
investigated using ToxCast bioassay data. An assay matrix based on
CI and DS was prepared for 335 assays with biologically intended target
information, and 28 CI assays and 3 DS assays were selected. Thirty
models established by combining five molecular fingerprints (i.e.,
Morgan, MACCS, RDKit, Pattern, and Layered) and six algorithms [i.e.,
gradient boosting tree, random forest (RF), multi-layered perceptron, k-nearest neighbor, logistic regression, and naive Bayes]
were trained using the selected assay data set. Of the 30 trained
models, MACCS–RF showed the best performance and thus was selected
for analyses of the effects of CI and DS. Results showed that recall
and F1 were significantly lower when training with the CI assays than
with the DS assays. In addition, hyperparameter tuning of the RF algorithm
significantly improved F1 on CI assays. This study provided a basis
for developing a toxicity classification model with improved performance
by evaluating the effects of data set characteristics. This study
also emphasized the importance of using appropriate evaluation metrics
and tuning hyperparameters in model development.