Abstract:Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware Large Language Models (LLMs) into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations. Our code is available at <a class="link-external link-http" href="http://github.com/zoom-wang112358/MOLLEO" rel="external noopener nofollow">this http URL</a>

What problem does this paper attempt to address?

This paper attempts to solve the problems encountered when using evolutionary algorithms (EAs) for optimization in molecular discovery. Specifically, molecular discovery can be formulated as an optimization problem, but since the optimization objective may be non - differentiable, this poses significant computational challenges. Evolutionary algorithms are often used to optimize black - box objectives in molecular discovery, traversing the chemical space by performing random mutation and crossover operations, but this leads to a large number of objective function evaluations and consumes a large amount of resources. Therefore, this paper proposes a new method - Molecular Language - Enhanced Evolutionary Optimization (MOLLEO), by integrating chemically - aware large language models (LLMs) into evolutionary algorithms, redesigning the crossover and mutation operations to reduce the number of required objective function evaluations and improve the quality of the final solution and the convergence speed. ### Main Contributions 1. **Reducing the Number of Objective Function Evaluations**: By using large language models to generate more reasonable molecular candidates, the number of expensive objective function evaluations required in the optimization process is reduced. 2. **Improving the Quality of Solutions**: MOLLEO not only improves the quality of the final solution but also accelerates the convergence process. 3. **Multi - task Verification**: Extensive empirical studies have been carried out on multiple tasks, including single - objective and multi - objective optimization tasks, demonstrating the superior performance of MOLLEO in different scenarios. 4. **Practical Applications**: In practical application scenarios such as drug design, MOLLEO shows better performance than baseline models, especially in complex tasks such as protein - ligand docking. ### Method Overview The core idea of MOLLEO is to use large language models as genetic operators to generate new molecular proposals. The specific steps are as follows: - **Initial Pool Selection**: Randomly select 120 molecules from the ZINC 250K database as the initial pool. - **Crossover Operation**: Use large language models to generate new molecules according to the target description, instead of randomly combining two parent molecules. - **Mutation Operation**: Perform mutation based on the target description for the optimal molecule in the current population. - **Evaluation and Selection**: Evaluate the newly generated molecules through the objective function and select the optimal molecules to enter the next generation. ### Experimental Results The experimental results show that MOLLEO outperforms baseline models on multiple tasks, especially in single - objective optimization tasks. MOLLEO (GPT - 4) performs best in 9 out of 12 tasks. In addition, MOLLEO (BioT5) and MOLLEO (MolSTM) also perform well, approaching MOLLEO (GPT - 4) in total scores respectively. ### Conclusion This paper effectively solves the optimization problem in molecular discovery by integrating large language models into evolutionary algorithms, significantly improving the optimization efficiency and the quality of solutions. This method has important application prospects in fields such as drug design.

Efficient Evolutionary Search Over Chemical Space with Large Language Models

Adaptive language model training for molecular design

Large Language Models as Evolutionary Optimizers

Large Language Model-Aided Evolutionary Search for Constrained Multiobjective Optimization

Small Molecule Optimization with Large Language Models

Towards Exploring Large Molecular Space: an Efficient Chemical Genetic Algorithm.

Adaptive Space Search-based Molecular Evolution Optimization Algorithm

Large Language Model Aided Multi-objective Evolutionary Algorithm: a Low-cost Adaptive Approach

Generative Design of Functional Metal Complexes Utilizing the Internal Knowledge of Large Language Models

DrugAssist: A Large Language Model for Molecule Optimization

CELLS: Cost-Effective Evolution in Latent Space for Goal-Directed Molecular Generation.

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Large Language Models as Surrogate Models in Evolutionary Algorithms: A Preliminary Study

Large Language Models as Molecular Design Engines

Large language models design sequence-defined macromolecules via evolutionary optimization

Navigating Ultra-Large Virtual Chemical Spaces with Product-of-Experts Chemical Language Models

Exploring the Improvement of Evolutionary Computation via Large Language Models

Deep Insights into Automated Optimization with Large Language Models and Evolutionary Algorithms

Algorithm Evolution Using Large Language Model.

Leveraging Latent Evolutionary Optimization for Targeted Molecule Generation