Reversible protein phosphorylation is one of the most important post-translational modifications, which regulates various biological cellular processes. Identification of the kinase-specific phosphorylation sites is helpful for understanding the phosphorylation mechanism and regulation processes. Although a number of computational approaches have been developed, currently few studies are concerned about hierarchical structures of kinases, and most of the existing tools use only local sequence information to construct predictive models. In this work, we conduct a systematic and hierarchy-specific investigation of protein phosphorylation site prediction in which protein kinases are clustered into hierarchical structures with four levels including kinase, subfamily, family and group. To enhance phosphorylation site prediction at all hierarchical levels, functional information of proteins, including gene ontology (GO) and protein-protein interaction (PPI), is adopted in addition to primary sequence to construct prediction models based on random forest. Analysis of selected GO and PPI features shows that functional information is critical in determining protein phosphorylation sites for every hierarchical level. Furthermore, the prediction results of Phospho.ELM and additional testing dataset demonstrate that the proposed method remarkably outperforms existing phosphorylation prediction methods at all hierarchical levels. The proposed method is freely available at http://bioinformatics.ustc.edu.cn/phos_pred/.
As a crucial post-translational modification, protein phosphorylation regulates almost all basic cellular processes. Recently, thousands of phosphorylation sites have been discovered by large-scale phospho-proteomics studies, but only about 20% of them have information regarding catalytic kinases, which brings a great challenge for correct identification of the protein kinases responsible for experimentally verified phosphorylation sites. In most existing identification tools, only a local sequence was selected to construct predictive models, and information regarding protein-protein interaction (PPI) was adopted for further filtering. However, the limited information utilized by these tools is not sufficient to identify protein kinases responsible for phosphorylated proteins. In this work, a novel computational approach that fully incorporates PPI and substrate structure information is proposed to improve the performance of human protein kinase identification. To handle the issue of high-dimensional PPI and structure data, a two-step feature selection algorithm that incorporates a support vector machine (SVM), is designed to detect information useful in discriminating the corresponding kinase of phosphorylation sites. Benchmark datasets for kinase identification are constructed using human protein phosphorylation data extracted from the latest Phospho.ELM database. With the selected PPI and structure features, the performance of kinase identification is significantly enhanced as compared with that obtained by using only sequence information. To further verify our method, we compared it with the state-of-the-art tools NetworKIN and IGPS at two stringency levels with medium (>90.0%) and high (>99.0%) specificity. The results show that our method outperforms existing tools in identifying protein kinases. Further evaluation demonstrates that our method also has superior performance on different hierarchical levels including kinase, subfamily, family and group.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.