Abstract:Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts.

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models

Large-scale chemical language representations capture molecular structure and properties

Unlocking comprehensive molecular design across all scenarios with large language model and unordered chemical language

Language models in molecular discovery

Melting point prediction of organic molecules by deciphering the chemical structure into a natural language

Large language model for molecular chemistry

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Inheritance of mitochondrial DNA variable number of tandem repeats in barfin flounder Verasper moseri and spotted halibut

Molecular fragmentation as a crucial step in the AI-based drug development pathway

[Mental care in infant health care centers].

Pre-trained Molecular Language Models with Random Functional Group Masking

DrugLLM: Open Large Language Model for Few-shot Molecule Generation

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Chemical Language Models for Molecular Design

Large Language Models as Molecular Design Engines

FraHMT: A Fragment-Oriented Heterogeneous Graph Molecular Generation Model for Target Proteins

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

Molecular language models: RNNs or transformer?

Can Large Language Models Understand Molecules?

Token-Mol 1.0: Tokenized drug design with large language model