Small Molecule Optimization with Large Language Models

Philipp Guevorguian,Menua Bedrosian,Tigran Fahradyan,Gayane Chilingaryan,Hrant Khachatrian,Armen Aghajanyan
2024-07-27
Abstract:Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.
Machine Learning,Neural and Evolutionary Computing,Quantitative Methods
What problem does this paper attempt to address?
The main goal of this paper is to propose a new method for small molecule optimization using large language models. Specifically, the authors developed two large language models tailored for the chemistry domain—Chemlactica and Chemma. These models were fine-tuned on a novel corpus containing 110 million molecules and their computed properties (a total of 40 billion tokens). These models excel in generating molecules with specific properties and predicting new molecular properties from limited samples. The paper introduces a novel optimization algorithm that leverages the aforementioned language models to optimize arbitrary properties of molecules, even when access to a black-box oracle is limited. This method combines concepts from genetic algorithms, rejection sampling, and prompt optimization, achieving state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement over previous methods in practical molecular optimization tasks. Additionally, the authors demonstrate the ability of their models to be efficiently fine-tuned with a small number of training examples to predict various molecular properties, achieving competitive results on standard benchmarks such as ESOL and FreeSolv. This indicates the potential of the models to quickly adapt to new tasks in the drug discovery pipeline. In summary, this paper aims to address the problem of how to effectively utilize large language models for molecular design and optimization, particularly to accelerate the molecular optimization stage in the drug discovery process.