Protein Language Model Fitness Is a Matter of Preference

Cade Gordon,Amy X. Lu,Pieter Abbeel
DOI: https://doi.org/10.1101/2024.10.03.616542
2024-10-03
Abstract:Leveraging billions of years of evolution, scientists have trained protein language models (pLMs) to understand the sequence and structure space of proteins aiding in the design of more functional proteins. Although they have shown ability to improve efficiency in engineering, it remains unclear if such models capture true biological patterns or artifacts of the training data. We aim to predict the circumstances in which pLMs can successfully perform zero-shot fitness estimation. Our work studies trends observed over hundreds of deep mutational scans across multiple different fitness objectives. We find that the likelihood, or abstractly, implicit preference of a certain protein sequence imbued during pretraining is predictive of fitness prediction capabilities. Both over-preferred and under-preferred wild type sequences harm performance. Using influence functions to causally understand how individual data points increase protein likelihoods, we find that there exists a power law tail due to sequence homology. Lastly, under-performance on low likelihood wild type proteins can be remedied by unsupervised finetuning. These findings that pLM zero-shot fitness estimation can be predicted by the likelihood of the engineered sequence can motivate and improve pLMs' deployment in protein maturation campaigns.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the capabilities and limitations of protein language models (pLMs) in zero-shot prediction of protein fitness. Specifically, the researchers focus on the following issues: 1. **Zero-shot prediction capability**: The researchers attempt to understand whether protein language models can successfully perform zero-shot fitness estimation without specific training. 2. **Relationship between preference and performance**: The researchers explore how the pre-training preferences of protein sequences affect the fitness prediction ability of pLMs. They find that certain protein sequences are given certain preferences during pre-training, which affects their performance in different tasks. 3. **Impact of high-preference and low-preference sequences**: The study shows that both excessively high and low preferences can impair the model's performance. For low-probability (low-preference) wild-type sequences, the model performs poorly; similarly, for excessively high-probability sequences, the model's performance is also affected. 4. **Causal relationship analysis**: By using influence functions, the researchers attempt to understand how individual data points influence the preferences of protein sequences and reveal that the influence distribution follows a power-law distribution. 5. **Improvement methods**: Based on the above findings, the researchers propose a method called "evo-tuning," which involves fine-tuning low-probability wild-type sequences to improve their fitness prediction ability while keeping high-probability sequences unchanged. In summary, the paper attempts to enhance the practical application of pLMs in protein engineering tasks by deeply analyzing the preference mechanisms of pLMs.