BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

Artem Zholus,Maksim Kuznetsov,Roman Schutski,Rim Shayakhmetov,Daniil Polykovskiy,Sarath Chandar,Alex Zhavoronkov
2024-06-06
Abstract:Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.
Machine Learning
What problem does this paper attempt to address?
This paper introduces a framework called BindGPT for scalable 3D molecular design through language modeling and reinforcement learning. Currently, generating novel active molecules for a given protein is a challenge for generative models, requiring an understanding of the complex physical interactions between the molecules and their environment. BindGPT proposes an innovative approach that can directly create 3D molecules within protein binding sites while generating molecular graphs and conformations without the need for additional graph reconstruction steps. The model is first pretrained on a large-scale dataset and then fine-tuned using scores from external simulation software through reinforcement learning. The pretrained model can serve as a 3D molecule generator, a graph-based conformation generator, and a 3D molecule generator under pocket conditions simultaneously. It is worth noting that the model does not rely on any specific domain-specific equivariance assumptions. The paper mentions that although existing methods can directly generate 3D molecules, most of them rely on external tools to build bonds, which may lead to accuracy issues. In contrast, BindGPT uses structural SMILES and XYZ formats to describe the molecular graph and atomic positions, reducing the dependence on external software. Experimental results demonstrate that BindGPT performs comparably to state-of-the-art diffusion models, language models, and graph neural networks in 3D molecular generation tasks, but with a two-order-of-magnitude improvement in sampling efficiency. Additionally, through reinforcement learning fine-tuning, the model can find structures with high binding scores for any given protein. In conclusion, the paper aims to address the problem of how to generate 3D active molecules more effectively, particularly considering their interactions with proteins, while reducing reliance on external tools and improving the accuracy and efficiency of generation.