GP-MoLFormer: A Foundation Model For Molecular Generation

Jerret Ross,Brian Belgodere,Samuel C. Hoffman,Vijil Chenthamarakshan,Youssef Mroueh,Payel Das
2024-04-05
Abstract:Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure-property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES. GP-MoLFormer uses a 46.8M parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. We explore the utility of GP-MoLFormer in generating novel, valid, and unique SMILES. Impressively, we find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion. We also find strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We evaluate GP-MoLFormer's utility and compare it with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, demonstrating its general utility.
Biomolecules,Machine Learning,Chemical Physics
What problem does this paper attempt to address?
The main goal of this paper is to propose a large-scale pre-trained model based on Transformer—GP-MoLFormer, for molecular generation tasks. Specifically, the model aims to address the following key issues: 1. **Molecular generation on large-scale chemical datasets**: By training on over 1.1 billion standardized SMILES strings, GP-MoLFormer is capable of generating a large number of novel, valid, and unique molecular structures. 2. **Relationship between training data memory and generation novelty**: The study investigates how the model memorizes training data and explores how this memory affects the novelty of generated molecules. It was found that deduplicated training data can reduce memory effects, thereby enhancing the uniqueness of generated molecules. 3. **Molecular design for specific tasks**: GP-MoLFormer can generate molecules not only under unconstrained conditions but also for two specific tasks: - **Scaffold-constrained molecular decoration**: Generating new molecular fragments based on a given scaffold structure. - **Unconstrained property-guided optimization**: Generating molecules with specific properties through a fine-tuning method (pair-tuning). 4. **Performance evaluation and comparison**: The paper compares the performance of GP-MoLFormer with other baseline models (such as CharRNN, VAE, etc.) in molecular generation and demonstrates the competitiveness of GP-MoLFormer in different tasks. In summary, GP-MoLFormer is a powerful molecular generation model that not only can generate novel and valid molecules in a vast chemical space but also can adapt to specific application scenarios such as scaffold decoration and property optimization through fine-tuning strategies.