TemStaPro: protein thermostability prediction using sequence representations from protein language models

Ieva Pudžiuvelytė,Kliment Olechnovič,Egle Godliauskaite,Kristupas Sermokas,Tomas Urbaitis,Giedrius Gasiunas,Darius Kazlauskas
DOI: https://doi.org/10.1093/bioinformatics/btae157
IF: 5.8
2024-03-20
Bioinformatics
Abstract:Abstract Motivation Reliable prediction of protein thermostability from its sequence is valuable for both academic and industrial research. This prediction problem can be tackled using machine learning and by taking advantage of the recent blossoming of deep learning methods for sequence analysis. These methods can facilitate training on more data and, possibly, enable development of more versatile thermostability predictors for multiple ranges of temperatures. Results We applied the principle of transfer learning to predict protein thermostability using embeddings generated by protein language models (pLMs) from an input protein sequence. We used large pLMs that were pre-trained on hundreds of millions of known sequences. The embeddings from such models allowed us to efficiently train and validate a high-performing prediction method using over one million sequences that we collected from organisms with annotated growth temperatures. Our method, TemStaPro (Temperatures of Stability for Proteins), was used to predict thermostability of CRISPR-Cas Class II effector proteins (C2EPs). Predictions indicated sharp differences among groups of C2EPs in terms of thermostability and were largely in tune with previously published and our newly obtained experimental data. Availability and Implementation TemStaPro software and the related data are freely available from https://github.com/ievapudz/TemStaPro and https://doi.org/10.5281/zenodo.7743637.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The paper aims to address the problem of predicting protein thermal stability, specifically by developing a computational tool that can efficiently predict thermal stability from protein sequences. The proposed method, called TemStaPro (Temperature of Stability for Proteins), utilizes sequence embeddings generated by protein language models (pLMs) for transfer learning. This approach allows for the training of high-performance predictive models on large-scale datasets. Main contributions include: 1. **Large-scale dataset construction**: The authors collected over 1 million protein sequences from organisms with known optimal growth temperatures and used them to train, validate, and test multiple binary classifiers that predict against different temperature thresholds. 2. **Utilization of protein language models**: By using embeddings generated by pre-trained large protein language models (such as ESM and ProtTrans ProtT5-XL) to represent the input protein sequences. 3. **Comprehensive prediction method**: Developed a software tool, TemStaPro, which combines multiple binary classifiers to predict protein stability at various temperature thresholds and checks for inconsistencies among these predictions. 4. **Performance evaluation**: Demonstrated the effectiveness and superiority of TemStaPro by predicting the thermal stability of proteins such as CRISPR-Cas Class II effector proteins (C2EPs) and comparing it with other methods on existing benchmark datasets like SAPPHIRE and iThermo. In summary, this study significantly enhances predictive capabilities by introducing a large-scale thermal stability prediction method based on protein language models, providing a valuable tool for both academia and industry.