Improving the Performance of Protein Kinase Identification Via High Dimensional Protein-Protein Interactions and Substrate Structure Data

Xiaoyi Xu,Ao Li,Liang Zou,Yi Shen,Wenwen Fan,Minghui Wang
DOI: https://doi.org/10.1039/c3mb70462a
2014-01-01
Molecular BioSystems
Abstract:As a crucial post-translational modification, protein phosphorylation regulates almost all basic cellular processes. Recently, thousands of phosphorylation sites have been discovered by large-scale phospho-proteomics studies, but only about 20% of them have information regarding catalytic kinases, which brings a great challenge for correct identification of the protein kinases responsible for experimentally verified phosphorylation sites. In most existing identification tools, only a local sequence was selected to construct predictive models, and information regarding protein-protein interaction (PPI) was adopted for further filtering. However, the limited information utilized by these tools is not sufficient to identify protein kinases responsible for phosphorylated proteins. In this work, a novel computational approach that fully incorporates PPI and substrate structure information is proposed to improve the performance of human protein kinase identification. To handle the issue of high-dimensional PPI and structure data, a two-step feature selection algorithm that incorporates a support vector machine (SVM), is designed to detect information useful in discriminating the corresponding kinase of phosphorylation sites. Benchmark datasets for kinase identification are constructed using human protein phosphorylation data extracted from the latest Phospho. ELM database. With the selected PPI and structure features, the performance of kinase identification is significantly enhanced as compared with that obtained by using only sequence information. To further verify our method, we compared it with the state-of-the-art tools NetworKIN and IGPS at two stringency levels with medium (>90.0%) and high (>99.0%) specificity. The results show that our method outperforms existing tools in identifying protein kinases. Further evaluation demonstrates that our method also has superior performance on different hierarchical levels including kinase, subfamily, family and group.
What problem does this paper attempt to address?