Abstract:Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.

Robust Prediction of Mutation-Induced Protein Stability Change by Property Encoding of Amino Acids.

Structure-based Prediction of the Effects of a Missense Variant on Protein Stability.

Combining Network Topological Characteristics With Sequence And Structure Based Features For Predicting Protein Stability Changes Upon Single Amino Acid Mutation

Assessing the Performance of Computational Predictors for Estimating Protein Stability Changes Upon Missense Mutations

Physicochemical feature-based classification of amino acid mutations.

Assessing computational tools for predicting protein stability changes upon missense mutations using a new dataset

Three Simple Properties Explain Protein Stability Change upon Mutation

Predicting protein thermal stability changes upon single and multi-point mutations via restricted attention subgraph neural network

BayeStab: Predicting Effects of Mutations on Protein Stability with Uncertainty Quantification

Comparing Supervised Learning and Rigorous Approach for Predicting Protein Stability upon Point Mutations in Difficult Targets

Convolution Neural Network-Based Prediction of Protein Thermostability.

Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method

PROST: AlphaFold2-aware Sequence-Based Predictor to Estimate Protein Stability Changes upon Missense Mutations

An Efficient Method to Predict Protein Thermostability in Alanine Mutation

ProS-GNN: Predicting Effects of Mutations on Protein Stability Using Graph Neural Networks

Predicting a Protein's Stability under a Million Mutations

A three-state prediction of single point mutations on protein stability changes

Predicting Protein Thermostability Upon Mutation Using Molecular Dynamics Timeseries Data

Protein stability models fail to capture epistatic interactions of double point mutations

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

STRUM: structure-based prediction of protein stability changes upon single-point mutation