GLDM: hit molecule generation with constrained graph latent diffusion model

Conghao Wang,Hiok Hian Ong,Shunsuke Chiba,Jagath C Rajapakse
DOI: https://doi.org/10.1093/bib/bbae142
IF: 9.5
2024-04-08
Briefings in Bioinformatics
Abstract:Discovering hit molecules with desired biological activity in a directed manner is a promising but profound task in computer-aided drug discovery. Inspired by recent generative AI approaches, particularly Diffusion Models (DM), we propose Graph Latent Diffusion Model (GLDM)—a latent DM that preserves both the effectiveness of autoencoders of compressing complex chemical data and the DM's capabilities of generating novel molecules. Specifically, we first develop an autoencoder to encode the molecular data into low-dimensional latent representations and then train the DM on the latent space to generate molecules inducing targeted biological activity defined by gene expression profiles. Manipulating DM in the latent space rather than the input space avoids complicated operations to map molecule decomposition and reconstruction to diffusion processes, and thus improves training efficiency. Experiments show that GLDM not only achieves outstanding performances on molecular generation benchmarks, but also generates samples with optimal chemical properties and potentials to induce desired biological activity.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the complex task of targeted generation of hit molecules with specific biological activities in computer - aided drug discovery. Specifically, the paper proposes a method named GLDM (Graph Latent Diffusion Model) for generating small - molecule drug candidates that can induce specific gene - expression changes. #### Main problem background 1. **Limitations of traditional drug - discovery methods**: - Traditional computer - aided drug - discovery methods are costly and time - consuming because a large number of candidate molecules need to be screened. - Most existing methods only focus on chemical validity and the chemical properties required for optimization, while ignoring biological insights, especially biological information such as gene - expression profiles. 2. **Deficiencies of existing generative models**: - Generative models based on the SMILES representation are concise, but they are prone to violate chemical rules during the generation process, resulting in invalid generated molecules. - Molecular - generation models with graph representations can ensure chemical validity, but the process of generating discrete structures is non - differentiable, increasing the training difficulty. 3. **The need to incorporate biological information**: - Existing research mainly focuses on single targets or simple gene - expression patterns, and lacks methods for combining detailed transcriptomic features with molecular generation. #### Solutions proposed in the paper To solve the above problems, the paper proposes GLDM, a graph - based latent - diffusion model, which is achieved in the following ways: - **Encoder and decoder**: Use a graph neural network (GNN) to encode the molecular graph into a low - dimensional latent representation and reconstruct the molecular graph through a decoder. - **Diffusion model**: Apply a diffusion model in the latent space, gradually add noise and finally generate a new molecular graph. - **Conditional generation**: Introduce a multi - head cross - attention mechanism to enable the generation process to be regulated according to a given gene - expression profile, ensuring that the generated molecules have the expected biological activity. Through this method, GLDM can not only perform well in unconditional - generation tasks, but also significantly outperform traditional methods in conditional - generation tasks, especially in generating molecules with the potential for specific gene - expression changes. #### Summary The main objective of this paper is to develop an efficient and accurate deep - generative model that can targetedly generate hit molecules with specific biological activities while taking into account biological information such as gene - expression profiles, thereby accelerating the drug - discovery process.