The nuclear receptor (NR) superfamily includes phylogenetically
related ligand-activated proteins, which play a key role in various
cellular activities. NR proteins are subdivided into seven subfamilies
based on their function, mechanism, and nature of the interacting
ligand. Developing robust tools to identify NR could give insights
into their functional relationships and involvement in disease pathways.
Existing NR prediction tools only use a few types of sequence-based
features and are tested on relatively similar independent datasets;
thus, they may suffer from overfitting when extended to new genera
of sequences. To address this problem, we developed Nuclear Receptor
Prediction Tool (NRPreTo), a two-level NR prediction tool with a unique
training approach where in addition to the sequence-based features
used by existing NR prediction tools, six additional feature groups
depicting various physiochemical, structural, and evolutionary features
of proteins were utilized. The first level of NRPreTo allows for the
successful prediction of a query protein as NR or non-NR and further
subclassifies the protein into one of the seven NR subfamilies in
the second level. We developed Random Forest classifiers to test on
benchmark datasets, as well as the entire human protein datasets from
RefSeq and Human Protein Reference Database (HPRD). We observed that
using additional feature groups improved the performance. We also
observed that NRPreTo achieved high performance on the external datasets
and predicted 59 novel NRs in the human proteome. The source code
of NRPreTo is publicly available at .