GexMolGen: Cross-modal Generation of Hit-like Molecules via Large Language Model Encoding of Gene Expression Signatures

Jiabei Cheng,Xiaoyong Pan,Kaiyuan Yang,Shenghao Cao,Bin Liu,Qingran Yan,Ye Yuan
DOI: https://doi.org/10.1101/2023.11.11.566725
2024-02-19
Abstract:The design of custom drugs with specific biological activity is a extremely difficult task, but it holds the potential to generate molecules without the discovery of target genes, beyond the modern paradigm of drug discovery. Traditional search methods rely heavily on existing recorded perturbation experiments, which is costly and lacks generalization. To overcome this limitation, we propose GexMol-Gen (Gene Expression-based Molecule Generator), a novel model that generates hit-like molecules using gene expression signatures derived from both the initial and desired states of gene expression. Our approach follows a “first-align-then-generate” strategy where we align the gene expression signatures and molecules within a mapping space, enabling a smooth transition from the former to the latter. The transformed molecular embeddings are then decoded to molecule graphs. In this framework, we employ an advanced single-cell large language model to allow flexibility in genetic modal input. We also pre-train a scaffold-based molecule model to improve efficiency, by guaranteeing all generated molecules are 100% valid. Empirical studies demonstrate that our model outperforms traditional search methods and offers unique advantages over existing deep learning methods. Overall, our model attempts to explore the chemical and biological correlations in order to facilitate precision medicine. The usage code for GexMolGen have been released on .
Bioinformatics
What problem does this paper attempt to address?