K-Nearest Neighbor Classifier Ensemble for Prediction of Phosphorylation Sites.
Zhiwen Yu,Zhongkai Deng,Hau-San Wong
2008-01-01
Abstract:—Recently, the researchers pay more attention toprediction of phosphorylation sites due to its important role inmany biological process, such as metabolism, growth, mem-brane transport, and so on. Though there exist a lot ofapproaches to predict the phosphorylation sites, few of themconsider the ensemble approach. In this paper, we first proposea new classifier ensemble framework called K-Nearest NeighborClassifier Ensemble (KNNCE) which incorporates the baggingtechnique and the K-nearest neighbor classifier into the ensem-ble framework for prediction of phosphorylation sites. Then, weapply KNNCE to six kinase families: CK1, GRK, GSK, INSR,PKB, and SRC. The experiments illustrate that (1) KNNCEachieves good results in these families, and (2) the accuraciesof the prediction system for these families are 69.25%, 69%,71.91%, 86.65%, 88.83% and 95.22% respectively. I. I NTRODUCTION Protein phosphorylation, as one of most important post-translational modifications in both prokaryotic and eukaryoticcells, is involved in the regulations of many cellular pathways[1][2], including metabolism, growth, differentiation andmembrane transport. Kinases, also known as phosphotrans-ferases, constituting a large protein superfamily, perform asthe enzymes in protein phosphorylation. In eukaryotic organ-isms, the most common form of phosphorylation is introduc-ing a phosphate group into a particular serine, threonine ortyrosine residue (phosphorylation site) of the substrate by thecatalysis of a specific kinase.Phosphorylation sites and the relevant kinases can beidentified in vivo and in vitro. Such methods include massspectrometry (MS) techniques (Aebersold et al., 2003 [3]),peptide microarray (Rychlewski et al., 2004 [4]), and phos-phospecific proteolysis (Knight, et al. 2003 [5]). Phos-pho.ELM (Diella et al, 2004 [6]) is a database of suchexperimentally verified phosphorylation sites in eukaryoticproteins. However, such methods are usually expensive andtime-consuming. With the fast growing number of proteinsequences published, computational approaches that predictphosphorylation sites more conveniently and efficiently arequite desired and quickly developed.Netphos (Blom et al., 1999 [7]) is such an early predic-tion system based on standard feed-forward artificial neuralnetwork, and it is extended to NetPhosk by Blom et al. 2004[8], which is a kinase-specific prediction system. Scansite(with the latest version 2.0) is a search tool for motifs thatare likely to be phosphorylated by specific kinases (Yaffeet al. 2001 [9]). It is based on matrix of the selectivityvalues of residues at each position relative to the experi-mentally identified phosphorylation sites. Kim et al. 2004[10] designed PredPhospho, also a kinase-specific predictionsystem, and they adopted SVM (support vector machine)as the core algorithm. Xue et al. 2005 [11] proposed agroup-based phosphorylation predicting and scoring (GPS)method, which calculates the similarity of motifs based onBLOSUM62 matrix. Xue et al. 2006 [12] also applied theapproach of Bayesian decision theory, called PPSP (Predic-tion of PK-specific Phosphorylation site). Hidden Markovmodel is also adopted by Huang et al. 2005 [13] in theirweb server KinasePhos 1.0 for computationally identifyingcatalytic kinase-specific phosphorylation sites. Wong et al.2007 [14] extended KinasePhose 1.0 to KinasePhos 2.0,which integrates SVM, protein sequence profile and proteincoupling pattern.Although there exist a number of approaches for pre-diction of phosphorylation sites, none of them considerthe classifier ensemble approach which combines multipleclassifiers to obtain more robust, stable and accurate results.In this paper, we propose a new classifier ensemble approachcalled K-Nearest Neighbor Classifier Ensemble (KNNCE)which incorporates the bagging technique and the K-nearestneighbor classifier into the ensemble framework to predicteukaryotic protein phosphorylation sites and improve theaccuracy, stability and robustness of the final predicted result.II. K-N