Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Xiuyuan Hu,Guoqing Liu,Yang Zhao,Hao Zhang
2024-01-15
Abstract:AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.
Machine Learning,Computational Engineering, Finance, and Science,Biomolecules
What problem does this paper attempt to address?
The paper aims to explore how large language models (LLMs) understand the spatial structure of drug molecules when processing chemical language and to verify whether these models can learn knowledge at the molecular fragment level through fine-tuning. Specifically, the authors used a pre-trained model based on the transformer architecture and performed reinforcement learning fine-tuning for drug design objectives. The research results indicate that the fine-tuned language model can not only generate valid SMILES sequences but also understand the spatial structure of drug molecules, rather than simply fitting the SMILES sequences. Additionally, by analyzing the changes in high-frequency SMILES substrings, the authors demonstrated that the language model indeed learned chemical knowledge related to drug design objectives, and this learning occurred at the molecular fragment level. These findings are significant for improving the interpretability and practicality of language models in drug design.