Abstract:Motivation: O-linked glycosylation, an essential post-translational modification process in Homo sapiens, involves attaching sugar moieties to the oxygen atoms of serine and/or threonine residues. It influences various biological and cellular functions. While threonine or serine residues within protein sequences are potential sites for O-linked glycosylation, not all serine and/or threonine residues undergo this modification, underscoring the importance of characterizing its occurrence. This study presents a novel approach for predicting intracellular and extracellular O-linked glycosylation events on proteins, which are crucial for comprehending cellular processes. Two base multi-layer perceptron models were trained by leveraging a stacked generalization framework. These base models respectively use ProtT5 and Ankh O-linked glycosylation site-specific embeddings whose combined predictions are used to train the meta-multi-layer perceptron model. Trained on extensive O-linked glycosylation datasets, the stacked-generalization model demonstrated high predictive performance on independent test datasets. Furthermore, the study emphasizes the distinction between nucleocytoplasmic and extracellular O-linked glycosylation, offering insights into their functional implications that were overlooked in previous studies. By integrating the protein language model's embedding with stacked generalization techniques, this approach enhances predictive accuracy of O-linked glycosylation events and illuminates the intricate roles of O-linked glycosylation in proteomics, potentially accelerating the discovery of novel glycosylation sites. Results: Stack-OglyPred-PLM produces Sensitivity, Specificity, Matthews Correlation Coefficient, and Accuracy of 90.50%, 89.60%, 0.464, and 89.70%, respectively on a benchmark NetOGlyc-4.0 independent test dataset. These results demonstrate that Stack-OglyPred-PLM is a robust computational tool to predict O-linked glycosylation sites in proteins. Availability and implementation: The developed tool, programs, training, and test dataset are available at https://github.com/PakhrinLab/Stack-OglyPred-PLM.

Predicting O-GlcNAcylation Sites in Mammalian Proteins with Transformers and RNNs Trained with a New Loss Function

Recurrent Neural Network-based Prediction of O-GlcNAcylation Sites in Mammalian Proteins

DeepO-GlcNAc: a web server for prediction of protein O-GlcNAcylation sites using deep learning combined with attention mechanism

O-GlcNAcPRED-DL: Prediction of Protein O-GlcNAcylation Sites Based on an Ensemble Model of Deep Learning

Prediction of human O-linked glycosylation sites using stacked generalization and embeddings from pre-trained protein language model

LMCrot: An enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model

CoNglyPred: Accurate Prediction of N-Linked Glycosylation Sites Using ESM-2 and Structural Features With Graph Network and Co-Attention

ProteinRPN: Towards Accurate Protein Function Prediction with Graph-Based Region Proposals

Deepmal: Accurate Prediction Of Protein Malonylation Sites By Deep Neural Networks

EMNGly: predicting N-linked glycosylation sites using the language models for feature extraction

DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction

Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function

Prediction of protein N-terminal acetylation modification sites based on CNN-BiLSTM-attention model

Deep Recurrent Neural Network for Protein Function Prediction from Sequence

Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers

Integration of A Deep Learning Classifier with A Random Forest Approach for Predicting Malonylation Sites

SPRINT-Gly: predicting N- and O-linked glycosylation sites of human and mouse proteins by using sequence and predicted structural properties

ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network

O-GlyThr: Prediction of human O-linked threonine glycosites using multi-feature fusion

TransPTM: a transformer-based model for non-histone acetylation site prediction

O-GlcNAcPRED-II: an Integrated Classification Algorithm for Identifying O-GlcNAcylation Sites Based on Fuzzy Undersampling and a K-means PCA Oversampling Technique