Abstract:Phosphorylation, as one of the most important post-translational modifications, plays a key role in various cellular physiological processes and disease occurrences. In recent years, computer technology has been gradually applied to the prediction of protein phosphorylation sites. However, most existing methods rely on simple protein sequence features that provide limited contextual information. To overcome this limitation, we propose DeepMPSF, a phosphorylation site prediction model based on multiple protein sequence features. There are two types of features: sequence semantic features, which comprise protein residue type information and relative position information within protein sequence, and protein background biophysical features, which include global semantic information containing more comprehensive protein background information obtained from pretrained models. To extract these features, DeepMPSF employs two separate subnetworks: the S71SFE module and the BBFE module, which automatically extract high-level semantic features. Our model incorporates a learning strategy for handling imbalanced datasets through ensemble learning during training and prediction. DeepMPSF is trained and evaluated on a well-established dataset of human proteins. Comparing the analysis with other benchmark methods reveals that DeepMPSF outperforms in predicting both S/T residues and Y residues. In particular, DeepMPSF showed excellent generalization performance in cross-species blind test performance, with an average improvement of 5.63%/5.72%, 22.28%/25.94%, 20.11%/17.49%, and 26.40%/28.33% for <i>Mus musculus</i>/<i>Rattus norvegicus</i> test sets in area under curves (AUCs) of ROC curve, AUC of the PR curve, F1-score, and MCC metrics, respectively. Furthermore, it also shows excellent performance in the latest updated case of natural proteins with functional phosphorylation sites. Through an ablation study and visual analysis, we uncover that the design of different feature modules significantly contributes to the accurate classification of DeepMPSF, which provides valuable insights for predicting phosphorylation sites and offers effective support for future downstream research.

iPhosH-PseAAC: Identify Phosphohistidine Sites in Proteins by Blending Statistical Moments and Position Relative Features According to the Chou's 5-Step Rule and General Pseudo Amino Acid Composition

PROSPECT: A web server for predicting protein histidine phosphorylation sites

Psphos: Pk-Specific Phosphorylation Site Prediction Using Profile Svm

Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition

pHisPred: a tool for the identification of histidine phosphorylation sites by integrating amino acid patterns and properties

Phosphorylation Site Prediction Integrating The Position Feature With Sequence Evolution Information

PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction

CaLMPhosKAN: Prediction of General Phosphorylation Sites in Proteins via Fusion of Codon Aware Embeddings with Amino Acid Aware Embeddings and Wavelet-based Kolmogorov Arnold Network

PredPhos: an Ensemble Framework for Structure-Based Prediction of Phosphorylation Sites

Prediction of Protein Phosphorylation Sites by Integrating Secondary Structure Information and Other One-Dimensional Structural Properties

Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs

Integrated Strategy for High-Confident Global Profiling of the Histidine Phosphoproteome

ScerePhoSite: An interpretable method for identifying fungal phosphorylation sites in proteins using sequence-based features

Computational Prediction and Analysis of Species-Specific Fungi Phosphorylation Via Feature Optimization Strategy

PhosAF: an Integrated Deep Learning Architecture for Predicting Protein Phosphorylation Sites with AlphaFold2 Predicted Structures

Improvement Of The Quantification Accuracy And Throughput For Phosphoproteome Analysis By A Pseudo Triplex Stable Isotope Dimethyl Labeling Approach

DeepMPSF: A Deep Learning Network for Predicting General Protein Phosphorylation Sites Based on Multiple Protein Sequence Features

Prediction of PK-specific phosphorylation site based on information entropy

PhosphoScan: A Probability-Based Method for Phosphorylation Site Prediction Using MS2/MS3 Pair Information

DeepPhoPred: Accurate Deep Learning Model to Predict Microbial Phosphorylation

General Phosphorylation Site Prediction Model Based on Attention Mechanism