Abstract:Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optmization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM-GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make `the best of both worlds' and create mutants with structural features and flexibility comparable to the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.

What problem does this paper attempt to address?

This paper presents a new framework that combines Genetic Algorithms (GAs) and Large Language Models (LLMs) to enhance enzyme design. Enzymes are optimized molecular machines in nature that catalyze chemical reactions, but designing new enzymes is a complex task due to the complexity of protein space and the tight relationship between sequence, structure, and function. Recently, LLMs have shown powerful capabilities in biological sequence modeling and analysis, but their application in high-dimensional protein space is limited. The researchers propose training LLMs on a large amount of protein sequence data to learn the relationship between amino acid residues and structure/function. Then, this knowledge is utilized by genetic algorithms to effectively search for sequences that improve catalytic performance. The study focuses on two optimization tasks: increasing the likelihood of enzymes catalyzing specific reactions and increasing their turnover rate (Kcat). Experimental results show that in the system evaluation of 105 biological catalytic reactions, the mutants generated by the LLM-GA framework are feasible in 90% of the cases, surpassing the wild-type enzyme. In-depth evaluation of seven reactions demonstrates that this approach can combine the advantages of both, creating mutants with similar structural features and flexibility as the wild type. The progress of this approach advances the computational design of biocatalysts and opens up possibilities for more sustainable chemical processes. The paper addresses the complexity of protein space optimization by introducing LLMs to guide amino acid substitution strategies and integrating GAs for dynamic optimization. By comparing different mutation strategies, the efficiency of the LLM strategy in guiding the generation of beneficial mutations is demonstrated, especially when allowing for larger sequence variations. In addition, the structure stability of the optimized enzyme is validated through molecular dynamics simulations, showing that the mutants improve catalytic efficiency while maintaining a similar dynamics as the wild type.

Integrating Genetic Algorithms and Language Models for Enhanced Enzyme Design

Conditional language models enable the efficient design of proficient enzymes

Adaptive language model training for molecular design

A language model assistant for biocatalysis

Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering

Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning

Enhancing molecular design efficiency: Uniting language models and generative networks with genetic algorithms

De novo design of triosephosphate isomerases using generative language models

Protein Language Models in Directed Evolution

On synergy between ultrahigh throughput screening and machine learning in biocatalyst engineering

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

Physics-based Modeling in the New Era of Enzyme Engineering

Enhancing luciferase activity and stability through generative modeling of natural enzyme sequences

Generative Design of Functional Metal Complexes Utilizing the Internal Knowledge of Large Language Models

Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design

Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering

COMPUTATIONAL ENZYME DESIGN APPROACHES WITH SIGNIFICANT BIOLOGICAL OUTCOMES: PROGRESS AND CHALLENGES

Navigating the landscape of enzyme design: from molecular simulations to machine learning

MetaEnzyme: Meta Pan-Enzyme Learning for Task-Adaptive Redesign

Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening