Abstract:Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

wwLearning the language of proteins and predicting the impact of mutations

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Language models enable zero-shot prediction of the effects of mutations on protein function

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction

Learning the protein language: Evolution, structure, and function

Enhancing missense variant pathogenicity prediction with protein language models using VariPred

Genome-wide prediction of disease variant effects with a deep protein language model

Protein Language Model Predicts Mutation Pathogenicity and Clinical Prognosis

From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models

Deciphering the Language of Nature: A transformer-based language model for deleterious mutations in proteins

Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model

Cross-protein transfer learning substantially improves disease variant prediction

Fine-tuning the ESM2 protein language model to understand the functional impact of missense variants

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Understanding structure-guided variant effect predictions using 3D convolutional neural networks

Predicted mechanistic impacts of human protein missense variants

Structure-Informed Protein Language Model

Multi-level Protein Representation Learning for Blind Mutational Effect Prediction

VariPred: Enhancing Pathogenicity Prediction of Missense Variants Using Protein Language Models