Small Molecule Optimization with Large Language Models

Philipp Guevorguian,Menua Bedrosian,Tigran Fahradyan,Gayane Chilingaryan,Hrant Khachatrian,Armen Aghajanyan

2024-07-27

Abstract:Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the training corpus, the language models and the optimization algorithm.

Machine Learning,Neural and Evolutionary Computing,Quantitative Methods

What problem does this paper attempt to address?

The main goal of this paper is to propose a new method for small molecule optimization using large language models. Specifically, the authors developed two large language models tailored for the chemistry domain—Chemlactica and Chemma. These models were fine-tuned on a novel corpus containing 110 million molecules and their computed properties (a total of 40 billion tokens). These models excel in generating molecules with specific properties and predicting new molecular properties from limited samples. The paper introduces a novel optimization algorithm that leverages the aforementioned language models to optimize arbitrary properties of molecules, even when access to a black-box oracle is limited. This method combines concepts from genetic algorithms, rejection sampling, and prompt optimization, achieving state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement over previous methods in practical molecular optimization tasks. Additionally, the authors demonstrate the ability of their models to be efficiently fine-tuned with a small number of training examples to predict various molecular properties, achieving competitive results on standard benchmarks such as ESOL and FreeSolv. This indicates the potential of the models to quickly adapt to new tasks in the drug discovery pipeline. In summary, this paper aims to address the problem of how to effectively utilize large language models for molecular design and optimization, particularly to accelerate the molecular optimization stage in the drug discovery process.

Small Molecule Optimization with Large Language Models

Large Language Models as Molecular Design Engines

DrugAssist: A Large Language Model for Molecule Optimization

Adaptive language model training for molecular design

Chemical Language Model Linker: blending text and molecules with modular adapters

De novo drug design as GPT language modeling: large chemistry models with supervised and reinforcement learning

Efficient Evolutionary Search Over Chemical Space with Large Language Models

Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

Domain-Agnostic Molecular Generation with Chemical Feedback

Language models in molecular discovery

Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

LICO: Large Language Models for In-Context Molecular Optimization

Large Language Models Open New Way of AI-Assisted Molecule Design for Chemists

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Preference Optimization for Molecular Language Models

Probabilistic generative transformer language models for generative design of molecules

Utilizing Large Language Models in an iterative paradigm with Domain feedback for Zero-shot Molecule optimization

Large language models design sequence-defined macromolecules via evolutionary optimization

Keeping it Simple: Language Models can learn Complex Molecular Distributions