Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening

Neil Thomas,David Belanger,Chenling Xu,Hanson Lee,Kat Hirano,Kosuke Iwai,Vanja Polic,Kendra D Nyberg,Kevin G Hoff,Lucas Frenz,Charlie A Emrich,Jun W Kim,Mariya Chavarha,Abi Ramanan,Jeremy J Agresti,Lucy J Colwell
DOI: https://doi.org/10.1101/2024.03.21.585615
2024-06-05
Abstract:Optimizing enzymes to function in novel chemical environments is a central goal of synthetic biology, but optimization is often hindered by a rugged, expansive protein search space and costly experiments. In this work, we present TeleProt, an ML framework that blends evolutionary and experimental data to design diverse protein variant libraries, and employ it to improve the catalytic activity of a nuclease enzyme that degrades biofilms that accumulate on chronic wounds. After multiple rounds of high-throughput experiments using both TeleProt and standard directed evolution (DE) approaches in parallel, we find that our approach found a significantly better top-performing enzyme variant than DE, had a better hit rate at finding diverse, high-activity variants, and was even able to design a high-performance initial library using no prior experimental data. We have released a dataset of 55K nuclease variants, one of the most extensive genotype-phenotype enzyme activity landscapes to date, to drive further progress in ML-guided design.
Bioinformatics
What problem does this paper attempt to address?
The main problem this paper attempts to address is to enhance the catalytic activity of nuclease (NucB) at neutral pH (pH 7) so that it can be effective in applications such as chronic wound care and anti-biofilm contamination. Specifically, the research team aims to design more active and diverse nuclease variants by combining machine learning (ML) and ultra-high-throughput screening techniques, thereby overcoming the limitations of traditional directed evolution methods in protein optimization. ### Main Problems: 1. **Enhancing NucB Catalytic Activity at pH 7**: - NucB exhibits high activity mainly in alkaline environments (e.g., pH 9) under natural conditions, but its activity significantly decreases at physiological pH (pH 7), limiting its application in areas such as chronic wound care. - The goal of the research is not only to restore NucB activity at pH 7 but also to further enhance its activity to exceed the wild-type activity at pH 9. 2. **Optimizing Protein Engineering Methods**: - Traditional directed evolution (DE) methods have limitations in optimizing protein activity, such as vast search space, high experimental costs, and the tendency to get trapped in local optima. - By introducing machine learning-guided directed evolution (MLDE), the research team hopes to find a more efficient method to design and screen high-activity protein variants. ### Solutions: - **TeleProt Framework**: Design a diverse protein variant library by combining evolutionary data and experimental data. - **Ultra-High-Throughput Screening**: Use a microfluidic platform for large-scale screening to quickly assess the activity of numerous protein variants. - **Multiple Rounds of Optimization**: Gradually optimize protein activity through multiple rounds of experiments and model iterations. - **Zero-Shot Design**: Generate an initial variant library using conserved patterns of natural homologous sequences without the need for experimental data. ### Experimental Results: - **MLDE Outperforms DE**: The MLDE method significantly outperforms traditional DE methods in discovering high-activity variants, especially in terms of diversity. - **Discovery of High-Activity Variants**: Using the MLDE method, the research team successfully discovered multiple NucB variants with significantly enhanced activity, with the best variant showing a 19-fold increase in activity. - **Biofilm Degradation Ability**: The best variant demonstrated significantly better biofilm degradation ability at neutral pH compared to wild-type NucB. ### Dataset Release: - The research team released a genotype-phenotype dataset containing 55,760 NucB variants, which is one of the most comprehensive enzyme activity landscape datasets to date, providing valuable resources for future ML-guided design. Through these methods, the research team not only addressed the issue of NucB activity at pH 7 but also provided new tools and methods for the field of protein engineering.