ECloudGen: Leveraging Electron Clouds as a Latent Variable to Scale Up Structure-based Molecular Design
Odin Zhang,Jieyu Jin,Zhenxing Wu,Jintu Zhang,Po Yuan,Haitao Lin,Haiyang Zhong,Xujun Zhang,Chenqing Hua,Weibo Zhao,Zhengshuo Zhang,Kejun Ying,Yufei Huang,Huifeng Zhao,Yuntao Yu,Yu Kang,Peichen Pan,Jike Wang,Dong Guo,Shuangjia Zheng,Chang-Yu Hsieh,Tingjun Hou
DOI: https://doi.org/10.1101/2024.06.03.597263
2024-12-26
Abstract:Structure-based molecule generation represents a significant advancement in AI-aided drug design (AIDD). However, progress in this domain is constrained by the scarcity of structural data on protein-ligand complexes, a challenge we term the Paradox of Sparse Chemical Space Generation. To address this limitation, we propose a novel latent variable approach that bridges the data gap between ligand-only and protein-ligand complexes, enabling the target-aware generative models to explore a broader chemical space and enhancing the quality of molecular generation. Drawing inspiration from quantum molecular simulations, we introduce ECloudGen, a generative model that leverages electron clouds as meaningful latent variables—an innovative integration of physical principles into deep learning frameworks. ECloudGen incorporates modern techniques, including latent diffusion models, Llama architectures, and a newly proposed contrastive learning task, which organizes the chemical space into a structured and highly interpretable latent representation. Benchmark studies demonstrate that ECloudGen outperforms state-of-the-art methods by generating more potent binders with superior physiochemical properties and by covering a significantly broader chemical space. The incorporation of electron clouds as latent variables not only improves generative performance but also introduces model-level interpretability, as illustrated in a case study designing V2R inhibitors. Furthermore, ECloudGen’s structurally ordered modeling of chemical space enables the development of a model-agnostic optimizer, extending its utility to molecular optimization tasks. This capability has been validated through a single-objective oracle benchmark and a complex multi-objective optimization scenario involving the redesign of endogenous BRD4 ligands. In conclusion, ECloudGen effectively addresses the Paradox of Sparse Chemical Space Generation through its integration of theoretical insights, advanced generative techniques, and real-world validation. The newly proposed technique of leveraging physical entities (such as electron clouds) as latent variables within a deep learning framework may prove useful for computational biology fields beyond AIDD.
Biology