DNA-binding proteins, performing an indispensable function in the maintenance of genetic information and holding significances for biomedical research, are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, the machine learning method as an emerging technique demonstrates satisfactory speed and decent accuracy. Thus, this work focuses on extracting four different features from primary and secondary sequence features, i.e., RS, PseAACS, PSSM-ACCT and PSSM-DWT. With the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are input into the training Ensemble subspace
DNA‐binding proteins perform an indispensable function in the maintenance and processing of genetic information and are inefficiently identified by traditional experimental methods due to their huge quantities. On the contrary, machine learning methods as an emerging technique demonstrate satisfactory speed and accuracy when used to study these molecules. This work focuses on extracting four different features from primary and secondary sequence features: Reduced sequence and index‐vectors (RS), Pseudo‐amino acid components (PseAACS), Position‐specific scoring matrix‐Auto Cross Covariance Transform (PSSM‐ACCT), and Position‐specific scoring matrix‐Discrete Wavelet Transform (PSSM‐DWT). Using the LASSO dimension reduction method, we experiment on the combination of feature submodels to obtain the optimized number of top rank features. These features are respectively input into the training Ensemble subspace discriminant, Ensemble bagged tree and KNN to predict the DNA‐binding proteins. Three different datasets, PDB594, PDB1075, and PDB186, are adopted to evaluate the performance of the as‐proposed approach in this work. The PDB1075 and PDB594 datasets are adopted for the five‐fold cross‐validation, and the PDB186 is used for the independent experiment. In the five‐fold cross‐validation, both the PDB1075 and PDB594 show extremely high accuracy, reaching 86.98% and 88.9% by Ensemble subspace discriminant, respectively. The accuracy of independent experiment by multi‐classifiers voting is 83.33%, which suggests that the methodology proposed in this work is capable of predicting DNA‐binding proteins effectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.