Exhaustive local chemical space exploration using a transformer model

Alessandro Tibo,Jiazhen He,Jon Paul Janet,Eva Nittinger,Ola Engkvist
DOI: https://doi.org/10.26434/chemrxiv-2023-v25xb-v2
2024-05-22
Abstract:How many near-neighbors does a molecule have? This is a simple, fundamental, but unsolved question in chemistry. It is key for solving many important molecular optimization problems, for example in lead optimization in drug discovery under the similarity principle assumption. Generative models can sample virtual molecules from a vast theoretical chemical space, but so far have lacked explicit knowledge about molecular similarity. This means that a generative model needs to be guided by reinforcement learning or another learning mechanism to be able to sample a relevant similar chemical space. Correspondingly the generative model provide no mechanism for quantifying how completely it can sample a particular region of the chemical space. To overcome these limitations, a novel source-target molecular transformer model is proposed, regularized via a similarity kernel function. It has been trained on, to the best of our knowledge, the largest data set of molecular pairs so far consisting of ≥ billion pairs. The regularization term enforces a direct relationship between the probability of generating a target molecule and its similarity to a given source molecule. The model is able to systematically sample compounds ordered by their probability and accordingly by their similarity. In combination with a deterministic sampling strategy, beam search, it is possible for the first time to comprehensively explore the near-neighborhood around a specific compound. Our results show that the regularization term helps to substantially improve the correlation between the probability of generating a target molecule and its similarity to the source molecule. The trained transformer model is able to exhaustively sample a near-neighborhood around a given drug-like molecule.
Chemistry
What problem does this paper attempt to address?
This paper discusses the problem of local search in chemical space, particularly focusing on how to effectively explore neighboring compounds that are similar to specific molecules. Currently, although generative models can sample virtual molecules from the vast theoretical chemical space, they lack a clear definition of molecular similarity. This means that methods such as reinforcement learning are needed to guide the sampling of chemically relevant spaces. To address this issue, the researchers propose a novel source-target molecular transformer model, which regularizes the model using a similarity kernel function. This model is trained on the largest molecular pair dataset to date (at least 200 billion pairs), and the regularization term ensures that the probability of generating the target molecule is directly correlated with its similarity to the given source molecule. By combining deterministic sampling strategies such as beam search, this model is able to systematically and comprehensively explore the neighborhood of specific compounds. The main contributions of the paper include: 1. Introducing a regularization term for the training loss that establishes a direct connection between the generation probability of the target molecule and its similarity to the source molecule. 2. Training a new foundational molecular transformer model using a large-scale dataset. 3. Providing a method to approximate thoroughly sampling the high probability neighborhood of a given source molecule using beam search. 4. Demonstrating the applicability of this framework to different datasets, similarity measures, and models. With this approach, the researchers are able to improve existing molecular transformer models to generate target molecules based on similarity and probability, which is especially important for molecular optimization tasks in drug discovery.