Abstract:The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.

Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Enhancing missense variant pathogenicity prediction with protein language models using VariPred

Structure-informed protein language models are robust predictors for variant effects

Efficient Inference, Training, and Fine-tuning of Protein Language Models

Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations

Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design

Genome-wide prediction of disease variant effects with a deep protein language model

From a single sequence to evolutionary trajectories: protein language models capture the evolutionary potential of SARS-CoV-2 protein sequences

Embeddings from protein language models predict conservation and variant effects

Protein Language Model Predicts Mutation Pathogenicity and Clinical Prognosis

Updated benchmarking of variant effect predictors using deep mutational scanning

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

Fine-tuning protein language models boosts predictions across diverse tasks

Cross-protein transfer learning substantially improves disease variant prediction

Unsupervised language models for disease variant prediction

Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models

Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction

ProBASS: a language model with sequence and structural features for predicting the effect of mutations on binding affinity

VariPred: Enhancing Pathogenicity Prediction of Missense Variants Using Protein Language Models