Recurrent Neural Network-based Prediction of O-GlcNAcylation Sites in Mammalian Proteins

Pedro Seber,Richard D. Braatz
DOI: https://doi.org/10.1101/2023.08.24.554563
2024-01-25
Abstract:O-GlcNAcylation has the potential to be an important target for therapeutics, but a motif or an algorithm to reliably predict O-GlcNAcylation sites is not available. In spite of the importance of O-GlcNAcylation, current predictive models are insufficient as they fail to generalize, and many are no longer available. This article constructs MLP and RNN models to predict the presence of O-GlcNAcylation sites based on protein sequences. Multiple different datasets are evaluated separately and assessed in terms of strengths and issues. The models trained in this work achieve considerably better metrics than previously published models, with at least a two-fold increase in F score relative to previously published models; the specific gains vary depending on the dataset. Within a given dataset, the results are robust to changes in cross-validation and test data as determined by nested validation. The best model achieves an F score of 36% (more than 3.5-fold greater than the previous best model) and a Matthews Correlation Coefficient of 35% (more than 4.5-fold greater than the previous best model), and, for the F score, 7.6-fold higher than when not using any model. Shapley values are used to interpret the model ‘s predictions and provide biological insight into O-GlcNAcylation.
Bioinformatics
What problem does this paper attempt to address?