Augmentation of Structure Information to the Sequence-Based Machine Learning-Assisted Directed Protein Evolution

Jangwook Jung,Lane Yutzy,Joohyun Kim,Kenny Nguyen,Peter Vallet,Jielin Yu,Ronggui He,Jianxiong Li,Le Yan
DOI: https://doi.org/10.26434/chemrxiv-2023-llpnk-v3
2024-03-05
Abstract:Directed evolution (DE) mimics natural selection to improve the functions of a target protein. Machine learning (ML) has significantly streamlined DE by aiding in several steps, which includes identifying starting variants, generating diverse libraries and modeling sequence-fitness relationships. To date, the majority of ML-assisted DE (MLDE) approaches has relied predominantly on sequence information due to the challenges and cost of obtaining protein structure information. Here, we introduce a structure-augmented MLDE (saMLDE) approach for selecting high fitness variants from a library of Protein G B1 domain. We adopted and applied a zero-shot sequence-based prediction method (offering the potential to discover new insights without extensive training data) to select an initial training library of 96 variants for the saMLDE campaign. To leverage protein structure information, we used protein structure prediction with AlphaFold2 and molecular docking simulations performed with Rosetta FlexPepDock, resulting in structure-based features derived with an induced fit model. After three rounds of the saMLDE campaign, we demonstrated that saMLDE incorporating structural information gradually improves the average fitness scores and the precision of predicted binders. In addition, we found that the initial library selection with zero-shot subset selection methods significantly impacted the average fitness scores and precision, consequently influencing the overall directed evolutionary trajectories.
Chemistry
What problem does this paper attempt to address?
This paper proposes a new method called Structure-Augmented Machine Learning-assisted Directed Evolution (saMLDE) to address the limitation of sequence information in traditional Machine Learning-assisted Directed Evolution (MLDE). In the process of directed evolution, protein function is optimized through simulating natural selection. Although machine learning has accelerated this process, the challenges and costs of obtaining protein structure information have limited its potential. The saMLDE method combines protein structures predicted by AlphaFold2 and molecular docking simulations using Rosetta FlexPepDock to generate structure-based features. The researchers used a zero-shot prediction method to select the initial training library, and then gradually improved the average fitness score of proteins and the precision of binder predictions through multiple rounds of experiments. The results show that the introduction of structure information can effectively improve the performance of directed evolution, and the selection of the initial library has a significant impact on the evolution trajectory. The paper also discusses how to balance the relationship between exploration and exploitation, as well as how to deal with the complexity of protein fitness landscapes, such as phenotype-dependent interactions. saMLDE enhances the predictive ability of the model through structure information, achieving significant progress even with limited data. In addition, compared to methods that rely solely on sequence information, saMLDE is more effective in reducing low-fitness variants, optimizing the trajectory of directed evolution. In conclusion, this paper addresses how to better utilize protein structure information to optimize machine learning-assisted directed evolution strategies and efficiently discover protein variants with desired functionalities.