De novo design of triosephosphate isomerases using generative language models

Sergio Romero-Romero,Alexander E Braun,Timo Kossendey,Noelia Ferruz,Steffen Schmidt,Birte Höcker
DOI: https://doi.org/10.1101/2024.11.10.622869
2024-11-10
Abstract:The design of proteins with tailored functions is of immense interest to biotechnology, medicine, and the chemical industry. While protein design is rapidly evolving with the use of AI techniques, the design of complex enzymes remains a challenge. Here, we present the use of two large language models (LLMs), ZymCTRL and ProtGPT2, for the generation of de novo enzymes that catalyze the triosephosphate isomerase (TIM) reaction. Natural TIM enzymes are obligatory oligomers that catalyze a multi-step isomerization reaction near the diffusion limit. This makes TIM an ideal target to assess the generative ability of protein language models. Newly generated sequences were filtered to obtain a set of twelve candidates from each approach for experimental validation. Multiple constructs from both language models exhibit the intended function in vivo through their ability to complement a TIM-deficient E. coli strain. In-depth characterization of the best-behaving artificial enzyme reveals behavior and catalytic efficiency close to its natural counterparts. These findings support the use of conditional and fine-tuned unconditional LLMs for the generation of complex enzymes.
Biochemistry
What problem does this paper attempt to address?
The problem this paper attempts to address is the design of complex enzymes with specific functions, particularly triosephosphate isomerase (TIM). Although some progress has been made in the field of protein design, designing complex enzymes with high activity remains a challenge. This paper uses two generative language model-based methods—ZymCTRL and ProtGPT2—to generate new protein sequences capable of catalyzing the TIM reaction and to verify whether these newly generated proteins have the expected functions. Specifically, the main objectives of the paper include: 1. **Generating new TIM protein sequences**: Using ZymCTRL and ProtGPT2 to generate new protein sequences capable of catalyzing the TIM reaction. 2. **Experimental validation**: Verifying through in vivo and in vitro experiments whether these newly generated proteins have the expected catalytic functions. 3. **Evaluating the capability of generative models**: Exploring the performance of generative models in generating complex enzymes, particularly their ability to generate new proteins that differ significantly from natural TIM protein sequences. Through these objectives, the paper aims to demonstrate the potential of generative language models in designing complex enzymes and to provide new methods and ideas for further research and applications.