Leveraging Multimodal Protein Representations to Predict Protein Melting Temperatures

Daiheng Zhang,Yan Zeng,Xinyu Hong,Jinbo Xu
2024-12-06
Abstract:Accurately predicting protein melting temperature changes (Delta Tm) is fundamental for assessing protein stability and guiding protein engineering. Leveraging multi-modal protein representations has shown great promise in capturing the complex relationships among protein sequences, structures, and functions. In this study, we develop models based on powerful protein language models, including ESM-2, ESM-3, SaProt, and AlphaFold, using various feature extraction methods to enhance prediction accuracy. By utilizing the ESM-3 model, we achieve a new state-of-the-art performance on the s571 test dataset, obtaining a Pearson correlation coefficient (PCC) of 0.50. Furthermore, we conduct a fair evaluation to compare the performance of different protein language models in the Delta Tm prediction task. Our results demonstrate that integrating multi-modal protein representations could advance the prediction of protein melting temperatures.
Machine Learning,Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
This paper aims to solve the problem of predicting the protein melting point temperature (∆Tm). Specifically, accurately predicting the protein melting point temperature is crucial for evaluating protein stability and guiding protein engineering. However, compared with the prediction of thermodynamic stability (∆∆G), there are fewer studies on the prediction of melting point temperature change (∆Tm), especially in the application of deep - learning methods, partly due to the lack of experimental data and low attention to this problem. To solve this problem, the authors proposed a new prediction framework, ESM3 - DTm, which uses multimodal protein representations to improve prediction accuracy. By integrating powerful protein language models (such as ESM2, ESM3, SaProt and AlphaFold) and adopting different feature extraction methods, the authors achieved a new state - of - the - art performance on the s571 test dataset, obtaining a Pearson correlation coefficient (PCC) of 0.50. In addition, the authors also fairly evaluated the performance of different protein language models in the ∆Tm prediction task, demonstrating the advantages of multimodal protein representations in improving prediction performance. ### Main contributions: 1. **Propose a new framework**: Developed the protein melting point temperature prediction framework ESM3 - DTm based on multimodal representations. 2. **Performance improvement**: Achieved a PCC of 0.50 on the s571 test dataset, significantly better than existing methods. 3. **Model comparison**: Verified the effectiveness of the multimodal model by comparing the performance of different protein language models. ### Formula explanations: - ∆Tm represents the change in melting point temperature. - PCC (Pearson Correlation Coefficient) is used to measure the linear correlation between the predicted value and the true value. The formula is: \[ \text{PCC}=\frac{\sum_{i = 1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i = 1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i = 1}^{n}(y_i-\bar{y})^2}} \] where \(x_i\) and \(y_i\) are the predicted value and the true value respectively, and \(\bar{x}\) and \(\bar{y}\) are their means. - MAE (Mean Absolute Error) measures the average absolute error between the predicted value and the true value. The formula is: \[ \text{MAE}=\frac{1}{n}\sum_{i = 1}^{n}|x_i - y_i| \] - RMSE (Root Mean Square Error) measures the root - mean - square error between the predicted value and the true value. The formula is: \[ \text{RMSE}=\sqrt{\frac{1}{n}\sum_{i = 1}^{n}(x_i - y_i)^2} \] Through these indicators, the authors comprehensively evaluated the prediction performance of the model and proved the superiority of multimodal protein representations in melting point temperature prediction.