Abstract:Synthetic biology is a fast-evolving research field that combines biology and engineering principles to develop new biological systems for medical, pharmacological, and industrial applications. Synthetic biologists use iterative "design, build, test, and learn" cycles to efficiently engineer genetic systems that are reliable, reproducible, and predictable. Protein engineering by directed evolution can benefit from such a systematic engineering approach for various reasons. Learning can be carried out before starting, throughout or after finalizing a directed evolution project. Computational tools, bioinformatics, and scanning mutagenesis methods can be excellent starting points, while molecular dynamics simulations and other strategies can guide engineering efforts. Similarly, studying protein intermediates along evolutionary pathways offers fascinating insights into the molecular mechanisms shaped by evolution. The learning step of the cycle is not only crucial for proteins or enzymes that are not suitable for high-throughput screening or selection systems, but it is also valuable for any platform that can generate a large amount of data that can be aided by machine learning algorithms. The main challenge in protein engineering is to predict the effect of a single mutation on one functional parameter-to say nothing of several mutations on multiple parameters. This is largely due to nonadditive mutational interactions, known as epistatic effects-beneficial mutations present in a genetic background may not be beneficial in another genetic background. In this work, we provide an overview of experimental and computational strategies that can guide the user to learn protein function at different stages in a directed evolution project. We also discuss how epistatic effects can influence the success of directed evolution projects. Since machine learning is gaining momentum in protein engineering and the field is becoming more interdisciplinary thanks to collaboration between mathematicians, computational scientists, engineers, molecular biologists, and chemists, we provide a general workflow that familiarizes nonexperts with the basic concepts, dataset requirements, learning approaches, model capabilities and performance metrics of this intriguing area. Finally, we also provide some practical recommendations on how machine learning can harness epistatic effects for engineering proteins in an "outside-the-box" way.

Learning-Based Estimation of Fitness Landscape Ruggedness for Directed Evolution

Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains

Exploring protein fitness landscapes by directed evolution

Evaluation of Machine Learning-Assisted Directed Evolution Across Diverse Combinatorial Landscapes

Optimisation strategies for directed evolution without sequencing

A rugged yet easily navigable fitness landscape

Active Learning-Assisted Directed Evolution

Learning protein fitness landscapes with deep mutational scanning data from multiple sources

Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden

Active Finetuning Protein Language Model: A Budget-Friendly Method for Directed Evolution

Knowledge-aware Reinforced Language Models for Protein Directed Evolution

Experimental Rugged Fitness Landscape in Protein Sequence Space.

A combinatorially complete epistatic fitness landscape in an enzyme active site

Protein engineering via Bayesian optimization-guided evolutionary algorithm and robotic experiments

Learning Strategies in Protein Directed Evolution

Accelerating protein engineering with fitness landscape modeling and reinforcement learning

Protein Language Models in Directed Evolution

Predicting evolution and visualizing high-dimensional fitness landscapes

Quantitative analyses of empirical fitness landscapes

Global optimality of fitness landscapes in evolution

Machine learning-assisted directed protein evolution with combinatorial libraries