Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models

Francesca Cuturello,Marco Celoria,Alessio Ansuini,Alberto Cazzaniga

DOI: https://doi.org/10.1101/2024.04.11.589002

2024-07-09

Abstract:Protein Language Models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle the problem, we fine-tune various pre-trained models using a recently released mega-scale dataset. Our approach employs a stringent policy to reduce the widespread issue of overfitting, by removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations. The code and data associated with this study are available at https://github.com/marco-celoria/PLM4Muts.

Bioinformatics

What problem does this paper attempt to address?

The paper aims to address the problem of predicting stability changes caused by single-point mutations in proteins. Specifically: - **Research Background**: Protein language models (PLMs) offer new perspectives in structural biology, solving complex problems relying solely on sequence information. Recent studies have demonstrated the potential of these models in predicting thermodynamic stability changes caused by single amino acid mutations. However, this task is complicated by the scarcity of experimental data. - **Research Methods**: The paper introduces two key innovations: 1. Utilizing protein language models that include multiple sequence alignments (MSA) to capture evolutionary information. 2. Using large-scale datasets and applying rigorous preprocessing to reduce overfitting issues. - **Data Processing**: To ensure no data leakage between the training and test sets, the researchers used BLASTp for local pairwise alignment and removed sequences in the training set that had high similarity to those in the test set based on specific thresholds. - **Model Optimization**: By fine-tuning different pre-trained models and conducting ablation studies and baseline evaluations, the study verified that the MSA Transformer has higher accuracy in predicting stability changes caused by single-point mutations in proteins. - **Experimental Results**: The optimized MSA Transformer model outperformed other existing methods across multiple datasets, particularly excelling in the S669 and ssym test sets, demonstrating its strong generalization capability. In summary, this paper aims to improve the accuracy of predicting the impact of single-point mutations on protein stability by combining protein language models with multiple sequence alignment techniques, thereby providing valuable insights for fields such as protein engineering and drug development.

Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models

Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models

Efficiently Predicting Protein Stability Changes Upon Single-point Mutation with Large Language Models

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Exploring evolution to uncover insights into protein mutational stability

Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset

Generative power of a protein language model trained on multiple sequence alignments

Efficient Inference, Training, and Fine-tuning of Protein Language Models

learnMSA2: deep protein multiple alignments with large language and hidden Markov models

MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

Retrieval Augmented Protein Language Models for Protein Structure Prediction

THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model

ProBASS: a language model with sequence and structural features for predicting the effect of mutations on binding affinity

So ManyFolds, So Little Time: Efficient Protein Structure Prediction With pLMs and MSAs

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

Exploring evolution-aware & -free protein language models as protein function predictors

wwLearning the language of proteins and predicting the impact of mutations

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction