Abstract:SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the limitations of existing SMILES (Simplified Molecular - Input Line - Entry System) language models. Specifically: 1. **Insufficient utilization of sub - structure information**: Existing pre - trained SMILES language models only focus on single - token - level supervision during the pre - training process and fail to fully utilize the sub - structure information of molecules. This makes the pre - training task too simple and the model unable to capture more abundant molecular semantic information. 2. **Mismatch between training and inference**: These SMILES language models only process corrupted SMILES inputs during pre - training and have never encountered valid SMILES, leading to the problem of inconsistency between training and inference. To solve these problems, the authors propose an edit - based pre - trained SMILES language model - **SMI - E DITOR**. This model is improved in the following ways: - **Introducing fragment - level supervision**: By randomly destroying sub - structures in molecules and feeding the resulting SMILES back to the model, the model is required to try to restore the original SMILES through an editing process. This not only introduces a fragment - level training signal but also allows the use of valid SMILES as input, enabling the model to learn how to reconstruct complete molecules from incomplete structures. - **Edit - based pre - training objective**: An edit - based pre - training objective is designed to enable the model to process valid SMILES sequences and restore missing sub - structures through an editing process. Through these improvements, the SMI - E DITOR model shows better scalability and a stronger ability to capture fragment - level molecular information. Experimental results show that SMI - E DITOR achieves state - of - the - art performance in multiple downstream molecular tasks and even outperforms some 3D molecular representation models. ### Formula presentation To ensure the correctness and readability of the formulas, the following are some of the formulas involved in the paper presented in Markdown format: 1. **Probability of deletion operation**: \[ \pi_{\text{del}}^\theta(i) = \text{Softmax}(W_d^T x_E^i) \] where \( W_d \) is a weight matrix of size \( H\times2 \), and \( H \) is the hidden layer size. 2. **Prediction of insertion operation position and quantity**: \[ \pi_{\text{ins}}^\theta(i) = \text{Softmax}(W_{\text{in}}^T x_E^i) \] where \( W_{\text{in}} \) is a weight matrix of size \( H\times256 \). 3. **Prediction of specific tokens for insertion operation**: \[ \pi_{\text{tok}}^\theta(i) = \text{Softmax}(W_{\text{tok}}^T x_E^i) \] where \( W_{\text{tok}} \) is a weight matrix of size \( H\times V \), and \( V \) represents the size of the vocabulary. 4. **Dual deletion loss**: \[ L_{\text{DualDel}}^\theta = -\sum_{y_i \in \hat{M}} \log \pi_{\text{del}}^\theta(d^*_i | i, \hat{M}) \] where \( d^* \) is the optimal deletion action determined by experts to minimize the Levenshtein distance to the target output \( y^* \), that is, the SMILES of molecule M. Through these improvements, the SMI - E DITOR model can learn the sub - structure semantic information of molecules more effectively and has achieved excellent performance in multiple molecular property prediction tasks.

SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision

Improve retrosynthesis planning with a molecular editing language

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

MolMetaLM: a Physicochemical Knowledge-Guided Molecular Meta Language Model

t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

t-SMILES: a fragment-based molecular representation framework for de novo ligand design

PromptSMILES: Prompting for scaffold decoration and fragment linking in chemical language models

Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models

IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

3D2SMILES: Translating Physical Molecular Models into Digital DeepSMILES Notations Using Deep Learning

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

Versatile Molecular Editing via Multimodal and Group-optimized Generative Learning

Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts

Towards 3D Molecule-Text Interpretation in Language Models

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations

Pre-trained Molecular Language Models with Random Functional Group Masking