MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Cheng Tan,Zhenxiao Cao,Zhangyang Gao,Lirong Wu,Siyuan Li,Yufei Huang,Jun Xia,Bozhen Hu,Stan Z. Li
2024-11-04
Abstract:Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome, regulating protein attributes and interactions that are crucial for biological processes. Accurately predicting PTM sites and their specific types is therefore essential for elucidating protein function and understanding disease mechanisms. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. However, these approaches often overlook protein structural contexts. In this work, we first compile a large-scale sequence-structure PTM dataset, which serves as the foundation for fair comparison. We introduce the MeToken model, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens. This model not only captures the typical sequence motifs associated with PTMs but also leverages the spatial arrangements dictated by protein tertiary structures, thus providing a holistic view of the factors influencing PTM sites. Designed to address the long-tail distribution of PTM types, MeToken employs uniform sub-codebooks that ensure even the rarest PTMs are adequately represented and distinguished. We validate the effectiveness and generalizability of MeToken across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight MeToken's potential in facilitating accurate and comprehensive PTM predictions, which could significantly impact proteomics research. The code and datasets are available at <a class="link-external link-https" href="https://github.com/A4Bio/MeToken" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the accurate prediction of protein post - translational modification (PTM) sites and their specific types. PTMs play a crucial role in regulating biological processes, changing protein properties and interactions. Therefore, accurate prediction of PTM sites is essential for understanding protein functions and disease mechanisms. However, existing computational methods mainly rely on protein sequences to predict PTM sites, ignoring protein structure information, resulting in limited prediction effects. Specifically, the paper aims to solve the following key problems: 1. **Limitations of existing methods**: Most existing computational methods mainly focus on sequence - dependent motifs, ignoring protein tertiary structure information. These methods often fail to capture complex context - dependent relationships, especially in modifications such as phosphorylation, where sequence motifs are not consistent and stable enough. 2. **Lack of large - scale sequence - structure PTM datasets**: Although methods of co - modeling sequence - structure have been explored in protein function prediction, in the field of PTM prediction, there is still a lack of a comprehensively annotated large - scale dataset containing sequence and structure information. 3. **Long - tail distribution problem of PTM types**: The distribution of PTM types is very uneven, and the modification frequencies of some types are much higher than those of other types. This unbalanced data distribution complicates model training and affects the generalization ability of the model. To solve these problems, the authors first constructed a large - scale sequence - structure PTM dataset and introduced a new model named MeToken. MeToken encodes the micro - environment of each amino acid into unified discrete tokens, integrating sequence and structure information, thus providing a more comprehensive perspective for predicting PTM sites. In addition, MeToken adopts a uniform sub - codebook strategy to deal with the long - tail distribution problem of PTM types, ensuring that even rare PTM types can be fully represented and distinguished. ### Summary of main contributions: - Constructed a large - scale sequence - structure PTM dataset, providing a basis for PTM prediction based on sequence - structure pairs. - Introduced the concept of micro - environment token, which can uniquely represent context information at the sequence and structure levels in a discrete token set. - Proposed a uniform sub - codebook strategy, which solves the long - tail distribution problem in PTM data and ensures that all PTM types can be fully represented and distinguished. These improvements significantly improve the accuracy of PTM prediction and provide strong support for proteomics research.