Protein Language Model Fitness Is a Matter of Preference

Cade Gordon,Amy X. Lu,Pieter Abbeel

DOI: https://doi.org/10.1101/2024.10.03.616542

2024-10-03

Abstract:Leveraging billions of years of evolution, scientists have trained protein language models (pLMs) to understand the sequence and structure space of proteins aiding in the design of more functional proteins. Although they have shown ability to improve efficiency in engineering, it remains unclear if such models capture true biological patterns or artifacts of the training data. We aim to predict the circumstances in which pLMs can successfully perform zero-shot fitness estimation. Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that pLM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve pLMs' deployment in protein maturation campaigns.

Bioinformatics

What problem does this paper attempt to address?

The paper aims to address the capabilities and limitations of protein language models (pLMs) in zero-shot prediction of protein fitness. Specifically, the researchers focus on the following issues: 1. **Zero-shot prediction capability**: The researchers attempt to understand whether protein language models can successfully perform zero-shot fitness estimation without specific training. 2. **Relationship between preference and performance**: The researchers explore how the pre-training preferences of protein sequences affect the fitness prediction ability of pLMs. They find that certain protein sequences are given certain preferences during pre-training, which affects their performance in different tasks. 3. **Impact of high-preference and low-preference sequences**: The study shows that both excessively high and low preferences can impair the model's performance. For low-probability (low-preference) wild-type sequences, the model performs poorly; similarly, for excessively high-probability sequences, the model's performance is also affected. 4. **Causal relationship analysis**: By using influence functions, the researchers attempt to understand how individual data points influence the preferences of protein sequences and reveal that the influence distribution follows a power-law distribution. 5. **Improvement methods**: Based on the above findings, the researchers propose a method called "evo-tuning," which involves fine-tuning low-probability wild-type sequences to improve their fitness prediction ability while keeping high-probability sequences unchanged. In summary, the paper attempts to enhance the practical application of pLMs in protein engineering tasks by deeply analyzing the preference mechanisms of pLMs.

Protein Language Model Fitness Is a Matter of Preference

Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design

Protein Language Models: Is Scaling Necessary?

Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods

Contrastive Fitness Learning: Reprogramming Protein Language Models for Low- Learning of Protein Fitness Landscape

Protein language models are biased by unequal sequence sampling across the tree of life

Structure-informed protein language models are robust predictors for variant effects

Metalic: Meta-Learning In-Context with Protein Language Models

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Fine-tuning protein language models boosts predictions across diverse tasks

Learning protein fitness models from evolutionary and assay-labeled data

Zero-shot transfer of protein sequence likelihood models to thermostability prediction

Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space

Efficient Inference, Training, and Fine-tuning of Protein Language Models

Protein language models learn evolutionary statistics of interacting sequence motifs

Protein Language Models in Directed Evolution

Learning protein fitness landscapes with deep mutational scanning data from multiple sources

Protein language models meet reduced amino acid alphabets

Language models enable zero-shot prediction of the effects of mutations on protein function

Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction