GP-MoLFormer: A Foundation Model For Molecular Generation

Jerret Ross,Brian Belgodere,Samuel C. Hoffman,Vijil Chenthamarakshan,Youssef Mroueh,Payel Das

2024-04-05

Abstract:Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility.

Biomolecules,Machine Learning,Chemical Physics

What problem does this paper attempt to address?

The main goal of this paper is to propose a large-scale pre-trained model based on Transformer—GP-MoLFormer, for molecular generation tasks. Specifically, the model aims to address the following key issues: 1. **Molecular generation on large-scale chemical datasets**: By training on over 1.1 billion standardized SMILES strings, GP-MoLFormer is capable of generating a large number of novel, valid, and unique molecular structures. 2. **Relationship between training data memory and generation novelty**: The study investigates how the model memorizes training data and explores how this memory affects the novelty of generated molecules. It was found that deduplicated training data can reduce memory effects, thereby enhancing the uniqueness of generated molecules. 3. **Molecular design for specific tasks**: GP-MoLFormer can generate molecules not only under unconstrained conditions but also for two specific tasks: - **Scaffold-constrained molecular decoration**: Generating new molecular fragments based on a given scaffold structure. - **Unconstrained property-guided optimization**: Generating molecules with specific properties through a fine-tuning method (pair-tuning). 4. **Performance evaluation and comparison**: The paper compares the performance of GP-MoLFormer with other baseline models (such as CharRNN, VAE, etc.) in molecular generation and demonstrates the competitiveness of GP-MoLFormer in different tasks. In summary, GP-MoLFormer is a powerful molecular generation model that not only can generate novel and valid molecules in a vast chemical space but also can adapt to specific application scenarios such as scaffold decoration and property optimization through fine-tuning strategies.

GP-MoLFormer: A Foundation Model For Molecular Generation

PrefixMol: Target- and Chemistry-aware Molecule Design Via Prefix Embedding

Probabilistic generative transformer language models for generative design of molecules

Domain-Agnostic Molecular Generation with Chemical Feedback

PromptSMILES: Prompting for scaffold decoration and fragment linking in chemical language models

MolGPT: Molecular Generation Using a Transformer-Decoder Model

Chemical Language Model Linker: blending text and molecules with modular adapters

LLamol: A Dynamic Multi-Conditional Generative Transformer for De Novo Molecular Design

Data-Efficient Graph Grammar Learning for Molecular Generation

LigGPT: Molecular Generation using a Transformer-Decoder Model

Pre-trained Molecular Language Models with Random Functional Group Masking

Improving the reliability of molecular string representations for generative chemistry

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Enhancing molecular design efficiency: Uniting language models and generative networks with genetic algorithms

Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language

Sc2Mol: a Scaffold-Based Two-Step Molecule Generator with Variational Autoencoder and Transformer.

GEN: Highly Efficient SMILES Explorer Using Autodidactic Generative Examination Networks

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

SMILES-based deep generative scaffold decorator for de-novo drug design