Text-Guided Multi-Property Molecular Optimization with a Diffusion Language Model

Yida Xiong,Kun Li,Weiwei Liu,Jia Wu,Bo Du,Shirui Pan,Wenbin Hu
2024-10-17
Abstract:Molecular optimization (MO) is a crucial stage in drug discovery in which task-oriented generated molecules are optimized to meet practical industrial requirements. Existing mainstream MO approaches primarily utilize external property predictors to guide iterative property optimization. However, learning all molecular samples in the vast chemical space is unrealistic for predictors. As a result, errors and noise are inevitably introduced during property prediction due to the nature of approximation. This leads to discrepancy accumulation, generalization reduction and suboptimal molecular candidates. In this paper, we propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM). TransDLM leverages standardized chemical nomenclature as semantic representations of molecules and implicitly embeds property requirements into textual descriptions, thereby preventing error propagation during diffusion process. Guided by physically and chemically detailed textual descriptions, TransDLM samples and optimizes encoded source molecules, retaining core scaffolds of source molecules and ensuring structural similarities. Moreover, TransDLM enables simultaneous sampling of multiple molecules, making it ideal for scalable, efficient large-scale optimization through distributed computation on web platforms. Furthermore, our approach surpasses state-of-the-art methods in optimizing molecular structural similarity and enhancing chemical properties on the benchmark dataset. The code is available at: <a class="link-external link-https" href="https://anonymous.4open.science/r/TransDLM-A901" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively optimize the generated molecules to meet the actual industrial needs in the process of drug discovery. The existing mainstream molecular optimization methods mainly rely on external property predictors to guide the iterative property optimization process. However, due to the large and complex chemical space, it is unrealistic for the predictor to learn all molecular samples, which inevitably introduces errors and noise in property prediction. These errors and noise will accumulate, affecting the quality of the optimization results and leading to sub - optimal molecular candidates. In addition, traditional molecular optimization methods mainly rely on chemists' experience, knowledge and intuition, which makes the process time - consuming and difficult to find the ideal molecule within a limited time. To address these challenges, this paper proposes a text - guided multi - attribute molecular optimization method based on the diffusion language model (TransDLM). TransDLM uses standardized chemical nomenclature as the semantic representation of molecules and implicitly embeds property requirements into text descriptions, thereby preventing error propagation during the diffusion process. Guided by detailed physical and chemical text descriptions, TransDLM can sample and optimize the encoded source molecules, retain the core skeleton of the source molecules and ensure structural similarity. Moreover, TransDLM supports sampling multiple molecules simultaneously, is suitable for large - scale optimization through distributed computing on network platforms, and improves the efficiency and scalability of optimization. Experimental results show that TransDLM is superior to existing methods in optimizing molecular structure similarity and enhancing chemical properties on the benchmark dataset.