Generation of connections between protein sequence space and chemical space to enable a predictive model for biocatalysis

Alison Narayan,Gabe Gomes,Alexandra Paton,Daniil Boiko,Jonathan Perkins,Nicholas Cemalovic,Thiago Reschützegger

DOI: https://doi.org/10.26434/chemrxiv-2024-w4dtr

2024-10-15

Abstract:The application of biocatalysis in synthesis has the potential to offer dramatically streamlined routes toward target molecules, exquisite and tunable catalyst-controlled selectivity, as well as more sustainable processes. Despite these advantages, biocatalytic synthetic strategies can be high risk to implement. Successful execution of these approaches requires identifying an enzyme capable of performing chemistry on a specific intermediate in a synthesis which often calls for extensive screening of enzymes and protein engineering. Strategies for predicting which enzyme is most likely to be compatible with a given small molecule have been hindered by the lack of well-studied biocatalytic reactions. The under exploration of connections between chemical and protein sequence spaces constrains navigation between these two landscapes. Herein, this longstanding challenge is overcome in a two-phase effort relying on high throughput experimentation to populate connections between substrate chemical space and biocatalyst sequence space, and the subsequent development of machine learning models that enable the navigation between these two landscapes. Using a curated library of α-ketoglutarate-dependent non-heme iron (NHI) enzymes, the BioCatSet1 dataset was generated to capture the reactivity of each biocatalyst with >100 substrates. In addition to the discovery of novel chemistry, BioCatSet1 was leveraged to develop a predictive workflow that provides a ranked list of enzymes that have the greatest compatibility with a given substrate. To make this tool accessible to the community, we built CATNIP, an open access web interface to our predictive workflows. We anticipate our approach can be readily expanded to additional enzyme and transformation classes, and will derisk the application of biocatalysis in chemical synthesis.

Chemistry

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: In biocatalysis, the difficult problem of predicting the compatibility between enzymes and specific substrates. Specifically, the paper aims to establish a prediction model that can effectively predict which enzymes are most likely to be compatible with a given small molecule, thereby promoting the application of biocatalysis. This challenge is mainly due to the lack of in - depth research on biocatalytic reactions and insufficient exploration of the connection between chemical space and protein sequence space. To overcome these obstacles, the authors propose a two - stage method: 1. **High - throughput experiments**: Generate data through high - throughput experiments to establish the connection between the substrate chemical space and the biocatalyst sequence space. Specifically, the authors used a carefully curated library of α - ketoglutarate - dependent non - heme iron (NHI) enzymes to generate the BioCatSet1 dataset, recording the reactivity of each biocatalyst with more than 100 substrates. 2. **Machine - learning model development**: Use the generated dataset to develop a machine - learning model to enable navigation between chemical space and protein sequence space. Through this method, a list of enzymes ranked by compatibility can be provided for a given substrate. Finally, the authors developed an open - access web interface named CATNIP, enabling researchers to conveniently use these prediction tools. This method not only discovered many new biocatalytic reactions but also significantly reduced the risk of applying biocatalysis in chemical synthesis.

Generation of connections between protein sequence space and chemical space to enable a predictive model for biocatalysis

Towards Genochemistry: Harnessing the Power of Biocatalysis for Research in the Life Sciences

Biocatalysed synthesis planning using data-driven learning

Exploration of bioinformatic domain based on data mining, reaction and enzyme promiscuity predictions

Machine learning modeling of family wide enzyme-substrate specificity screens

Microdroplet screening rapidly profiles a biocatalyst to enable its AI-assisted engineering

Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning

Combining chemistry and protein engineering for new-to-nature biocatalysis

ALDELE: All-Purpose Deep Learning Toolkits for Predicting the Biocatalytic Activities of Enzymes

Inferring Catalysis in Biological Systems

Evolutionary-Scale Enzymology Enables Biochemical Constant Prediction Across a Multi-Peaked Catalytic Landscape

Expanding chemistry through in vitro and in vivo biocatalysis

CatPred: A comprehensive framework for deep learning in vitro enzyme kinetic parameters , and

Data‐Driven Protein Engineering for Improving Catalytic Activity and Selectivity

Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design

Biocatalysis: landmark discoveries and applications in chemical synthesis

Expanding the Boundaries of Biocatalysis

Biocatalytic strategy for the construction of sp 3 -rich polycyclic compounds from directed evolution and computational modelling

On synergy between ultrahigh throughput screening and machine learning in biocatalyst engineering

Strategies for designing biocatalysts with new functions