Protein Design by Directed Evolution Guided by Large Language Models

Trong Thanh Tran,Truong Son Hy

DOI: https://doi.org/10.1101/2023.11.28.568945

2024-08-18

Abstract:Directed evolution, a strategy for protein engineering, optimizes protein properties (i.e., fitness) by a rigorous and resource-intensive process of screening or selecting among a vast range of mutations. By conducting an in silico screening of sequence properties, machine learning-guided directed evolution (MLDE) can expedite the optimization process and alleviate the experimental workload. In this work, we propose a general MLDE framework in which we apply recent advancements of Deep Learning in protein representation learning and protein property prediction to accelerate the searching and optimization processes. In particular, we introduce an optimization pipeline that utilizes Large Language Models (LLMs) to pinpoint the mutation hotspots in the sequence and then suggest replacements to improve the overall fitness. Our experiments have shown the superior efficiency and efficacy of our proposed framework in the conditional protein generation, in comparision with other state-of-the-art baseline algorithms. We expect this work will shed a new light on not only protein engineering but also on solving combinatorial problems using data-driven methods. Our implementation is publicly available at https://github.com/HySonLab/Directed_Evolution

Bioinformatics

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accelerate the directed evolution process in protein engineering by combining large - language models (LLMs) and machine - learning techniques, in order to improve the performance (i.e., fitness) of proteins. Specifically, the paper proposes a new machine - learning - guided directed evolution framework (MLDE), which utilizes the latest advances in deep learning in protein representation learning and protein property prediction to optimize the search and improvement process. This framework aims to reduce the experimental workload and speed up the optimization process by identifying mutation hotspots in the sequence and suggesting replacements to improve the overall fitness. The framework proposed in the paper mainly includes the following steps: 1. **Selecting mutation positions**: Two strategies, namely random masking and importance masking, are used to determine the positions that need to be mutated. The importance masking strategy selects masking positions based on the importance of k - mers, while random masking randomly selects positions for masking. 2. **Generating mutant sequences**: The partially masked sequence is input into the pre - trained ESM - 2 model to predict the masked amino acids and generate new mutant sequences. 3. **Predicting fitness**: The fine - tuned Attention1D model is used to predict the fitness values of the newly generated sequences. 4. **Selecting the optimal variant**: According to the predicted fitness values, the sequence with the highest fitness is selected as the candidate sequence for the next generation, and the above process is repeated until the predetermined number of iterations is reached or a specific stopping condition is met. Through this framework, the paper demonstrates its high efficiency and effectiveness in the conditional protein generation task, and has significant advantages compared with other existing advanced baseline algorithms. This research is not only of great significance to the field of protein engineering, but also provides new ideas for using data - driven methods to solve combinatorial optimization problems.

Protein Design by Directed Evolution Guided by Large Language Models

Protein Design by Directed Evolution Guided by Large Language Models

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Latent-based Directed Evolution accelerated by Gradient Ascent for Protein Sequence Design

Machine learning-guided directed evolution for protein engineering

Protein Language Models in Directed Evolution

Machine-learning-guided directed evolution for protein engineering

Evolutionary context-integrated deep sequence modeling for protein engineering

Augmentation of Structure Information to the Sequence-Based Machine Learning-Assisted Directed Protein Evolution

Active Learning-Assisted Directed Evolution

In Vitro Continuous Protein Evolution Empowered by Machine Learning and Automation.

Machine learning-assisted directed protein evolution with combinatorial libraries

Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden

ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution

Cluster learning-assisted directed evolution

Reinforcement Learning for Sequence Design Leveraging Protein Language Models

Rapid protein evolution by few-shot learning with a protein language model

Machine Learning for Protein Engineering

Mathematics-assisted directed evolution and protein engineering

Evaluation of Machine Learning-Assisted Directed Evolution Across Diverse Combinatorial Landscapes