Sitetack: A Deep Learning Model that Improves PTM Prediction by Using Known PTMs

Clair S. Gutierrez,Alia A. Kassim,Benjamin D. Gutierrez,Ronald T. Raines
DOI: https://doi.org/10.1101/2024.06.03.596298
2024-06-04
Abstract:Post-translational modifications (PTMs) increase the diversity of the proteome and are vital to organismal life and therapeutic strategies. Deep learning has been used to predict PTM locations. Still, limitations in datasets and their analyses compromise success. Here we evaluate the use of known PTM sites in prediction via sequence-based deep learning algorithms. Specifically, PTM locations were encoded as a separate amino acid before sequences were encoded via word embedding and passed into a convolutional neural network that predicts the probability of a modification at a given site. Without labeling known PTMs, our model is on par with others. With labeling, however, we improved significantly upon extant models. Moreover, knowing PTM locations can increase the predictability of a different PTM. Our findings highlight the importance of PTMs for the installation of additional PTMs. We anticipate that including known PTM locations will enhance the performance of other proteomic machine learning algorithms.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the prediction of other post - translational modification (PTM) sites by deep - learning models through the utilization of known PTM site information. Specifically, the authors developed a deep - learning model named Sitetack, aiming to improve the accuracy of predicting different types of PTM sites by encoding known PTM sites as individual amino acids and combining sequence information. ### Main problems and solutions 1. **Limitations of existing methods**: - Although deep - learning has been used to predict PTM sites, due to the limitations of data sets and their analysis, the success rate of prediction is still not high. - Existing computational methods rarely systematically evaluate how PTMs affect the prediction of the same type or other types of PTM sites. 2. **Introduction of known PTM site information**: - The authors hypothesized that incorporating known PTM site information into model training could significantly improve prediction performance. - The specific method is to encode known PTM sites as special amino acid symbols (such as "@" or "&"), and then input them together with the protein sequence into a convolutional neural network (CNN) for training. 3. **Verification and improvement**: - The authors carried out extensive experiments through data sets of multiple PTM types (such as phosphorylation, N - glycosylation, O - glycosylation, etc.), verifying this hypothesis. - The results show that in most cases, the model containing known PTM site information performs significantly better than the model without this information. ### Key findings - **Performance improvement**: Among multiple PTM types, especially phosphorylation and hydroxylation, the model containing known PTM site information shows a significant performance improvement. For example, for the human phosphorylation model, the AUC is improved from 0.881 to 0.931. - **Interactions between PTMs**: The study also found that there are cross - influences between certain PTMs. For example, the presence of phosphorylation sites can improve the prediction accuracy of O - GlcNAc glycosylation. - **Specific kinase models**: For the phosphorylation prediction of specific kinases, the model containing known phosphorylation site information also shows better performance. ### Summary This paper shows that by introducing known PTM site information, the prediction ability of deep - learning models for PTM sites can be significantly improved. This not only helps to predict protein post - translational modifications more accurately, but also provides a new perspective for further understanding the interactions between PTMs. In addition, the authors also developed a free online tool ([Sitetack](https://sitetack.net)) to facilitate researchers to use these improved models for prediction.