Designing diverse and high-performance proteins with a large language model in the loop
Carlos Gomez-Uribe,Japheth Gado,Meiirbek Islamov
DOI: https://doi.org/10.1101/2024.10.25.620340
2024-10-29
Abstract:We present a novel protein engineering approach to directed evolution with machine learning that integrates a new semi-supervised neural network fitness prediction model, Seq2Fitness, and an innovative optimization algorithm, biphasic annealing for diverse adaptive sequence sampling (BADASS) to design sequences. Seq2Fitness leverages protein language models to predict fitness landscapes, combining evolutionary data with experimental labels, while BADASS efficiently explores these landscapes by dynamically adjusting temperature and mutation energies to prevent premature convergence and find diverse high-fitness sequences. Seq2Fitness predictions improve the Spearman correlation with fitness measurements over alternative model predictions, e.g., from 0.34 to 0.55 for sequences with mutations residues that are absent from the training set. BADASS requires less memory and computation compared to gradient-based Markov Chain Monte Carlo methods, while finding more higher-fitness sequences and maintaining sequence diversity in protein design tasks for two different protein families with hundreds of amino acids. For example, for both protein families 100% of the top 10,000 sequences found by BADASS have higher Seq2Fitness predictions than the wildtype sequence, versus a broad range between 3% to 99% for competing approaches with often many fewer than 10,000 sequences found. The fitness predictions for the top, top 100th, and top 1,000th sequences found by BADASS are all also higher. In addition, we developed a theoretical framework to explain where BADASS comes from, why it works, and how it behaves. Although we only evaluate BADASS here on amino acid sequences, it may be more broadly useful for exploration of other sequence spaces, including DNA and RNA. To ensure reproducibility and facilitate adoption, our code is publicly available at https://github.com/SoluLearn/BADASS/.
Bioengineering