Generation of connections between protein sequence space and chemical space to enable a predictive model for biocatalysis

Alison Narayan,Gabe Gomes,Alexandra Paton,Daniil Boiko,Jonathan Perkins,Nicholas Cemalovic,Thiago Reschützegger
DOI: https://doi.org/10.26434/chemrxiv-2024-w4dtr
2024-10-15
Abstract:The application of biocatalysis in synthesis has the potential to offer dramatically streamlined routes toward target molecules, exquisite and tunable catalyst-controlled selectivity, as well as more sustainable processes. Despite these advantages, biocatalytic synthetic strategies can be high risk to implement. Successful execution of these approaches requires identifying an enzyme capable of performing chemistry on a specific intermediate in a synthesis which often calls for extensive screening of enzymes and protein engineering. Strategies for predicting which enzyme is most likely to be compatible with a given small molecule have been hindered by the lack of well-studied biocatalytic reactions. The under exploration of connections between chemical and protein sequence spaces constrains navigation between these two landscapes. Herein, this longstanding challenge is overcome in a two-phase effort relying on high throughput experimentation to populate connections between substrate chemical space and biocatalyst sequence space, and the subsequent development of machine learning models that enable the navigation between these two landscapes. Using a curated library of α-ketoglutarate-dependent non-heme iron (NHI) enzymes, the BioCatSet1 dataset was generated to capture the reactivity of each biocatalyst with >100 substrates. In addition to the discovery of novel chemistry, BioCatSet1 was leveraged to develop a predictive workflow that provides a ranked list of enzymes that have the greatest compatibility with a given substrate. To make this tool accessible to the community, we built CATNIP, an open access web interface to our predictive workflows. We anticipate our approach can be readily expanded to additional enzyme and transformation classes, and will derisk the application of biocatalysis in chemical synthesis.
Chemistry
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: In biocatalysis, the difficult problem of predicting the compatibility between enzymes and specific substrates. Specifically, the paper aims to establish a prediction model that can effectively predict which enzymes are most likely to be compatible with a given small molecule, thereby promoting the application of biocatalysis. This challenge is mainly due to the lack of in - depth research on biocatalytic reactions and insufficient exploration of the connection between chemical space and protein sequence space. To overcome these obstacles, the authors propose a two - stage method: 1. **High - throughput experiments**: Generate data through high - throughput experiments to establish the connection between the substrate chemical space and the biocatalyst sequence space. Specifically, the authors used a carefully curated library of α - ketoglutarate - dependent non - heme iron (NHI) enzymes to generate the BioCatSet1 dataset, recording the reactivity of each biocatalyst with more than 100 substrates. 2. **Machine - learning model development**: Use the generated dataset to develop a machine - learning model to enable navigation between chemical space and protein sequence space. Through this method, a list of enzymes ranked by compatibility can be provided for a given substrate. Finally, the authors developed an open - access web interface named CATNIP, enabling researchers to conveniently use these prediction tools. This method not only discovered many new biocatalytic reactions but also significantly reduced the risk of applying biocatalysis in chemical synthesis.