Exhaustive local chemical space exploration using a transformer model

Alessandro Tibo,Jiazhen He,Jon Paul Janet,Eva Nittinger,Ola Engkvist

DOI: https://doi.org/10.26434/chemrxiv-2023-v25xb-v2

2024-05-22

Abstract:How many near-neighbors does a molecule have? This is a simple, fundamental, but unsolved question in chemistry. It is key for solving many important molecular optimization problems, for example in lead optimization in drug discovery under the similarity principle assumption. Generative models can sample virtual molecules from a vast theoretical chemical space, but so far have lacked explicit knowledge about molecular similarity. This means that a generative model needs to be guided by reinforcement learning or another learning mechanism to be able to sample a relevant similar chemical space. Correspondingly the generative model provide no mechanism for quantifying how completely it can sample a particular region of the chemical space. To overcome these limitations, a novel source-target molecular transformer model is proposed, regularized via a similarity kernel function. It has been trained on, to the best of our knowledge, the largest data set of molecular pairs so far consisting of ≥ billion pairs. The regularization term enforces a direct relationship between the probability of generating a target molecule and its similarity to a given source molecule. The model is able to systematically sample compounds ordered by their probability and accordingly by their similarity. In combination with a deterministic sampling strategy, beam search, it is possible for the first time to comprehensively explore the near-neighborhood around a specific compound. Our results show that the regularization term helps to substantially improve the correlation between the probability of generating a target molecule and its similarity to the source molecule. The trained transformer model is able to exhaustively sample a near-neighborhood around a given drug-like molecule.

Chemistry

What problem does this paper attempt to address?

This paper discusses the problem of local search in chemical space, particularly focusing on how to effectively explore neighboring compounds that are similar to specific molecules. Currently, although generative models can sample virtual molecules from the vast theoretical chemical space, they lack a clear definition of molecular similarity. This means that methods such as reinforcement learning are needed to guide the sampling of chemically relevant spaces. To address this issue, the researchers propose a novel source-target molecular transformer model, which regularizes the model using a similarity kernel function. This model is trained on the largest molecular pair dataset to date (at least 200 billion pairs), and the regularization term ensures that the probability of generating the target molecule is directly correlated with its similarity to the given source molecule. By combining deterministic sampling strategies such as beam search, this model is able to systematically and comprehensively explore the neighborhood of specific compounds. The main contributions of the paper include: 1. Introducing a regularization term for the training loss that establishes a direct connection between the generation probability of the target molecule and its similarity to the source molecule. 2. Training a new foundational molecular transformer model using a large-scale dataset. 3. Providing a method to approximate thoroughly sampling the high probability neighborhood of a given source molecule using beam search. 4. Demonstrating the applicability of this framework to different datasets, similarity measures, and models. With this approach, the researchers are able to improve existing molecular transformer models to generate target molecules based on similarity and probability, which is especially important for molecular optimization tasks in drug discovery.

Exhaustive local chemical space exploration using a transformer model

PrefixMol: Target- and Chemistry-aware Molecule Design Via Prefix Embedding

A Transformer-based Generative Model for De Novo Molecular Design

Evaluation of Reinforcement Learning in Transformer-based Molecular Design

3D-Transformer: Molecular Representation with Transformer in 3D Space

Generation of Dual-Target Compounds Using a Transformer Chemical Language Model

Probabilistic generative transformer language models for generative design of molecules

C5T5: Controllable Generation of Organic Molecules with Transformers

Transformers and Large Language Models for Chemistry and Drug Discovery

Generative Model for Small Molecules with Latent Space RL Fine-Tuning to Protein Targets

Repurformer: Transformers for Repurposing-Aware Molecule Generation

Molecule generation using transformers and policy gradient reinforcement learning

Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention

Projecting Molecules into Synthesizable Chemical Spaces

cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation

Sc2Mol: a Scaffold-Based Two-Step Molecule Generator with Variational Autoencoder and Transformer.

Transformer Graph Variational Autoencoder for Generative Molecular Design

Molecular Transformer - A Model for Uncertainty-Calibrated Chemical Reaction Prediction

LLamol: A Dynamic Multi-Conditional Generative Transformer for De Novo Molecular Design

One Transformer Can Understand Both 2D & 3D Molecular Data

Optimization of binding affinities in chemical space with generative pre-trained transformer and deep reinforcement learning