PhosF3C: A Feature Fusion Architecture with Fine-Tuned Protein Language Model and Conformer for prediction of general phosphorylation site

Yuhuan Liu,Haitian Zhong,Jixiu Zhai,Xueying Wang,Tianchi LU
DOI: https://doi.org/10.1101/2024.12.25.630296
2024-12-25
Abstract:Protein phosphorylation, a key post-translational modification (PTM), provides essential insight into protein properties, making its prediction highly significant. Using the emerging capabilities of large language models (LLMs), we apply LoRA fine-tuning to ESM2, a powerful protein large language model, to efficiently extract features with minimal computational resources, optimizing task-specific text alignment. Additionally, we integrate the conformer architecture with the Feature Coupling Unit (FCU) to enhance local and global feature exchange, further improving prediction accuracy. Our model achieves state-of-the-art (SOTA) performance, obtaining AUC scores of 79.5%, 76.3%, and 71.4% at the S, T, and Y sites of the general data sets. Based on the powerful feature extraction capabilities of LLMs, we conduct a series of analyses on protein representations, including studies on their structure, sequence, and various chemical properties (such as Hydrophobicity (GRAVY), Surface Charge, and Isoelectric Point). We propose a test method called Linear Regression Tomography (LRT) which is a top-down method using representation to explore the model's feature extraction capabilities, offering a pathway to improved interpretability.
Biology
What problem does this paper attempt to address?