Predicting O-GlcNAcylation Sites in Mammalian Proteins with Transformers and RNNs Trained with a New Loss Function

Pedro Seber
2024-08-27
Abstract:Glycosylation, a protein modification, has multiple essential functional and structural roles. O-GlcNAcylation, a subtype of glycosylation, has the potential to be an important target for therapeutics, but methods to reliably predict O-GlcNAcylation sites had not been available until 2023; a 2021 review correctly noted that published models were insufficient and failed to generalize. Moreover, many are no longer usable. In 2023, a considerably better RNN model with an F$_1$ score of 36.17% and an MCC of 34.57% on a large dataset was published. This article first sought to improve these metrics using transformer encoders. While transformers displayed high performance on this dataset, their performance was inferior to that of the previously published RNN. We then created a new loss function, which we call the weighted focal differentiable MCC, to improve the performance of classification models. RNN models trained with this new function display superior performance to models trained using the weighted cross-entropy loss; this new function can also be used to fine-tune trained models. A two-cell RNN trained with this loss achieves state-of-the-art performance in O-GlcNAcylation site prediction with an F$_1$ score of 38.88% and an MCC of 38.20% on that large dataset.
Machine Learning,Molecular Networks
What problem does this paper attempt to address?
The paper attempts to address the problem of predicting O-GlcNAcylation sites in mammalian proteins. O-GlcNAcylation is a glycosylation modification that is crucial for the function and structure of proteins and is associated with various diseases such as cancer, infection, and heart failure. However, reliable prediction methods did not emerge until 2023. Previous studies have shown that existing models perform inadequately and cannot be effectively generalized. Therefore, the main objective of this paper is to improve existing models and enhance prediction performance. Specifically, the authors first attempted to use a Transformer encoder to improve prediction performance but found that the Transformer performed worse than previously published Recurrent Neural Network (RNN) models. Subsequently, the authors developed a new loss function—the Weighted Focal Differentiable Matthews Correlation Coefficient (MCC)—to further enhance the performance of the classification model. The RNN model trained with this new loss function outperformed the model trained with Weighted Cross-Entropy loss. Ultimately, a two-layer RNN model achieved state-of-the-art prediction performance on the same large dataset, with an F1 score of 38.88% and an MCC of 38.20%.