Abstract:Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome, regulating protein attributes and interactions that are crucial for biological processes. Accurately predicting PTM sites and their specific types is therefore essential for elucidating protein function and understanding disease mechanisms. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. However, these approaches often overlook protein structural contexts. In this work, we first compile a large-scale sequence-structure PTM dataset, which serves as the foundation for fair comparison. We introduce the MeToken model, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens. This model not only captures the typical sequence motifs associated with PTMs but also leverages the spatial arrangements dictated by protein tertiary structures, thus providing a holistic view of the factors influencing PTM sites. Designed to address the long-tail distribution of PTM types, MeToken employs uniform sub-codebooks that ensure even the rarest PTMs are adequately represented and distinguished. We validate the effectiveness and generalizability of MeToken across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight MeToken's potential in facilitating accurate and comprehensive PTM predictions, which could significantly impact proteomics research. The code and datasets are available at <a class="link-external link-https" href="https://github.com/A4Bio/MeToken" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the accurate prediction of protein post - translational modification (PTM) sites and their specific types. PTMs play a crucial role in regulating biological processes, changing protein properties and interactions. Therefore, accurate prediction of PTM sites is essential for understanding protein functions and disease mechanisms. However, existing computational methods mainly rely on protein sequences to predict PTM sites, ignoring protein structure information, resulting in limited prediction effects. Specifically, the paper aims to solve the following key problems: 1. **Limitations of existing methods**: Most existing computational methods mainly focus on sequence - dependent motifs, ignoring protein tertiary structure information. These methods often fail to capture complex context - dependent relationships, especially in modifications such as phosphorylation, where sequence motifs are not consistent and stable enough. 2. **Lack of large - scale sequence - structure PTM datasets**: Although methods of co - modeling sequence - structure have been explored in protein function prediction, in the field of PTM prediction, there is still a lack of a comprehensively annotated large - scale dataset containing sequence and structure information. 3. **Long - tail distribution problem of PTM types**: The distribution of PTM types is very uneven, and the modification frequencies of some types are much higher than those of other types. This unbalanced data distribution complicates model training and affects the generalization ability of the model. To solve these problems, the authors first constructed a large - scale sequence - structure PTM dataset and introduced a new model named MeToken. MeToken encodes the micro - environment of each amino acid into unified discrete tokens, integrating sequence and structure information, thus providing a more comprehensive perspective for predicting PTM sites. In addition, MeToken adopts a uniform sub - codebook strategy to deal with the long - tail distribution problem of PTM types, ensuring that even rare PTM types can be fully represented and distinguished. ### Summary of main contributions: - Constructed a large - scale sequence - structure PTM dataset, providing a basis for PTM prediction based on sequence - structure pairs. - Introduced the concept of micro - environment token, which can uniquely represent context information at the sequence and structure levels in a discrete token set. - Proposed a uniform sub - codebook strategy, which solves the long - tail distribution problem in PTM data and ensures that all PTM types can be fully represented and distinguished. These improvements significantly improve the accuracy of PTM prediction and provide strong support for proteomics research.

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Systematic Characterization and Prediction of Post-Translational Modification Cross-Talk.

TransPTM: a transformer-based model for non-histone acetylation site prediction

PTM-ssMP: A Web Server for Predicting Different Types of Post-translational Modification Sites Using Novel Site-specific Modification Profile

Sitetack: A Deep Learning Model that Improves PTM Prediction by Using Known PTMs

Validation of an Abbreviated Pharmacokinetic Profile for the Estimation of Mycophenolic Acid Exposure in Pediatric Renal Transplant Recipients*

DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Systematic Characterization and Prediction of Post-Translational Modification Cross-Talk Between Proteins

Prediction and Analysis of Multiple Protein Lysine Modified Sites Based on Conditional Wasserstein Generative Adversarial Networks

cytogenetic and germ cell effects of phosphine inhalation by rodents: II. subacute exposures to rats and mice

PTM-X: Prediction of Post-Translational Modification Crosstalk Within and Across Proteins.

MPTM: A tool for mining protein post-translational modifications from literature.

PCB mass transfer coefficients determined by application of a water surface sampler.

MSstatsPTM: Statistical Relative Quantification of Posttranslational Modifications in Bottom-Up Mass Spectrometry-Based Proteomics

Tokenizing Foldable Protein Structures with Machine-Learned Artificial Amino-Acid Vocabulary

Improving PTM Site Prediction by Coupling of Multi-Granularity Structure and Multi-Scale Sequence Representation

A Novel Method for Predicting Post-Translational Modifications on Serine and Threonine Sites by Using Site-Modification Network Profiles

Placental Permeability to Sucrose: A Source of Error in Measuring Volumes of Sucrose Distribution in Gravidas

PTMint database of experimentally verified PTM regulation on protein–protein interaction

Progresses in Predicting Post-translational Modification