Efficient Evolutionary Search Over Chemical Space with Large Language Models

Haorui Wang,Marta Skreta,Cher-Tian Ser,Wenhao Gao,Lingkai Kong,Felix Strieth-Kalthoff,Chenru Duan,Yuchen Zhuang,Yue Yu,Yanqiao Zhu,Yuanqi Du,Alán Aspuru-Guzik,Kirill Neklyudov,Chao Zhang
2024-07-03
Abstract:Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware Large Language Models (LLMs) into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations. Our code is available at <a class="link-external link-http" href="http://github.com/zoom-wang112358/MOLLEO" rel="external noopener nofollow">this http URL</a>
Neural and Evolutionary Computing,Artificial Intelligence,Machine Learning,Chemical Physics
What problem does this paper attempt to address?
This paper attempts to solve the problems encountered when using evolutionary algorithms (EAs) for optimization in molecular discovery. Specifically, molecular discovery can be formulated as an optimization problem, but since the optimization objective may be non - differentiable, this poses significant computational challenges. Evolutionary algorithms are often used to optimize black - box objectives in molecular discovery, traversing the chemical space by performing random mutation and crossover operations, but this leads to a large number of objective function evaluations and consumes a large amount of resources. Therefore, this paper proposes a new method - Molecular Language - Enhanced Evolutionary Optimization (MOLLEO), by integrating chemically - aware large language models (LLMs) into evolutionary algorithms, redesigning the crossover and mutation operations to reduce the number of required objective function evaluations and improve the quality of the final solution and the convergence speed. ### Main Contributions 1. **Reducing the Number of Objective Function Evaluations**: By using large language models to generate more reasonable molecular candidates, the number of expensive objective function evaluations required in the optimization process is reduced. 2. **Improving the Quality of Solutions**: MOLLEO not only improves the quality of the final solution but also accelerates the convergence process. 3. **Multi - task Verification**: Extensive empirical studies have been carried out on multiple tasks, including single - objective and multi - objective optimization tasks, demonstrating the superior performance of MOLLEO in different scenarios. 4. **Practical Applications**: In practical application scenarios such as drug design, MOLLEO shows better performance than baseline models, especially in complex tasks such as protein - ligand docking. ### Method Overview The core idea of MOLLEO is to use large language models as genetic operators to generate new molecular proposals. The specific steps are as follows: - **Initial Pool Selection**: Randomly select 120 molecules from the ZINC 250K database as the initial pool. - **Crossover Operation**: Use large language models to generate new molecules according to the target description, instead of randomly combining two parent molecules. - **Mutation Operation**: Perform mutation based on the target description for the optimal molecule in the current population. - **Evaluation and Selection**: Evaluate the newly generated molecules through the objective function and select the optimal molecules to enter the next generation. ### Experimental Results The experimental results show that MOLLEO outperforms baseline models on multiple tasks, especially in single - objective optimization tasks. MOLLEO (GPT - 4) performs best in 9 out of 12 tasks. In addition, MOLLEO (BioT5) and MOLLEO (MolSTM) also perform well, approaching MOLLEO (GPT - 4) in total scores respectively. ### Conclusion This paper effectively solves the optimization problem in molecular discovery by integrating large language models into evolutionary algorithms, significantly improving the optimization efficiency and the quality of solutions. This method has important application prospects in fields such as drug design.