Integrating Genetic Algorithms and Language Models for Enhanced Enzyme Design

Yves Gaetan Nana Teukam,Federico Zipoli,Teodoro Laino,Emanuele Criscuolo,Francesca Grisoni,Matteo Manica
DOI: https://doi.org/10.26434/chemrxiv-2024-j7ntq
2024-03-13
Abstract:Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optmization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM-GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make `the best of both worlds' and create mutants with structural features and flexibility comparable to the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.
Chemistry
What problem does this paper attempt to address?
This paper presents a new framework that combines Genetic Algorithms (GAs) and Large Language Models (LLMs) to enhance enzyme design. Enzymes are optimized molecular machines in nature that catalyze chemical reactions, but designing new enzymes is a complex task due to the complexity of protein space and the tight relationship between sequence, structure, and function. Recently, LLMs have shown powerful capabilities in biological sequence modeling and analysis, but their application in high-dimensional protein space is limited. The researchers propose training LLMs on a large amount of protein sequence data to learn the relationship between amino acid residues and structure/function. Then, this knowledge is utilized by genetic algorithms to effectively search for sequences that improve catalytic performance. The study focuses on two optimization tasks: increasing the likelihood of enzymes catalyzing specific reactions and increasing their turnover rate (Kcat). Experimental results show that in the system evaluation of 105 biological catalytic reactions, the mutants generated by the LLM-GA framework are feasible in 90% of the cases, surpassing the wild-type enzyme. In-depth evaluation of seven reactions demonstrates that this approach can combine the advantages of both, creating mutants with similar structural features and flexibility as the wild type. The progress of this approach advances the computational design of biocatalysts and opens up possibilities for more sustainable chemical processes. The paper addresses the complexity of protein space optimization by introducing LLMs to guide amino acid substitution strategies and integrating GAs for dynamic optimization. By comparing different mutation strategies, the efficiency of the LLM strategy in guiding the generation of beneficial mutations is demonstrated, especially when allowing for larger sequence variations. In addition, the structure stability of the optimized enzyme is validated through molecular dynamics simulations, showing that the mutants improve catalytic efficiency while maintaining a similar dynamics as the wild type.