SMI-Editor: Edit-based SMILES Language Model with Fragment-level Supervision

Kangjie Zheng,Siyue Liang,Junwei Yang,Bin Feng,Zequn Liu,Wei Ju,Zhiping Xiao,Ming Zhang
2024-12-07
Abstract:SMILES, a crucial textual representation of molecular structures, has garnered significant attention as a foundation for pre-trained language models (LMs). However, most existing pre-trained SMILES LMs focus solely on the single-token level supervision during pre-training, failing to fully leverage the substructural information of molecules. This limitation makes the pre-training task overly simplistic, preventing the models from capturing richer molecular semantic information. Moreover, during pre-training, these SMILES LMs only process corrupted SMILES inputs, never encountering any valid SMILES, which leads to a train-inference mismatch. To address these challenges, we propose SMI-Editor, a novel edit-based pre-trained SMILES LM. SMI-Editor disrupts substructures within a molecule at random and feeds the resulting SMILES back into the model, which then attempts to restore the original SMILES through an editing process. This approach not only introduces fragment-level training signals, but also enables the use of valid SMILES as inputs, allowing the model to learn how to reconstruct complete molecules from these incomplete structures. As a result, the model demonstrates improved scalability and an enhanced ability to capture fragment-level molecular information. Experimental results show that SMI-Editor achieves state-of-the-art performance across multiple downstream molecular tasks, and even outperforming several 3D molecular representation models.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the limitations of existing SMILES (Simplified Molecular - Input Line - Entry System) language models. Specifically: 1. **Insufficient utilization of sub - structure information**: Existing pre - trained SMILES language models only focus on single - token - level supervision during the pre - training process and fail to fully utilize the sub - structure information of molecules. This makes the pre - training task too simple and the model unable to capture more abundant molecular semantic information. 2. **Mismatch between training and inference**: These SMILES language models only process corrupted SMILES inputs during pre - training and have never encountered valid SMILES, leading to the problem of inconsistency between training and inference. To solve these problems, the authors propose an edit - based pre - trained SMILES language model - **SMI - E DITOR**. This model is improved in the following ways: - **Introducing fragment - level supervision**: By randomly destroying sub - structures in molecules and feeding the resulting SMILES back to the model, the model is required to try to restore the original SMILES through an editing process. This not only introduces a fragment - level training signal but also allows the use of valid SMILES as input, enabling the model to learn how to reconstruct complete molecules from incomplete structures. - **Edit - based pre - training objective**: An edit - based pre - training objective is designed to enable the model to process valid SMILES sequences and restore missing sub - structures through an editing process. Through these improvements, the SMI - E DITOR model shows better scalability and a stronger ability to capture fragment - level molecular information. Experimental results show that SMI - E DITOR achieves state - of - the - art performance in multiple downstream molecular tasks and even outperforms some 3D molecular representation models. ### Formula presentation To ensure the correctness and readability of the formulas, the following are some of the formulas involved in the paper presented in Markdown format: 1. **Probability of deletion operation**: \[ \pi_{\text{del}}^\theta(i) = \text{Softmax}(W_d^T x_E^i) \] where \( W_d \) is a weight matrix of size \( H\times2 \), and \( H \) is the hidden layer size. 2. **Prediction of insertion operation position and quantity**: \[ \pi_{\text{ins}}^\theta(i) = \text{Softmax}(W_{\text{in}}^T x_E^i) \] where \( W_{\text{in}} \) is a weight matrix of size \( H\times256 \). 3. **Prediction of specific tokens for insertion operation**: \[ \pi_{\text{tok}}^\theta(i) = \text{Softmax}(W_{\text{tok}}^T x_E^i) \] where \( W_{\text{tok}} \) is a weight matrix of size \( H\times V \), and \( V \) represents the size of the vocabulary. 4. **Dual deletion loss**: \[ L_{\text{DualDel}}^\theta = -\sum_{y_i \in \hat{M}} \log \pi_{\text{del}}^\theta(d^*_i | i, \hat{M}) \] where \( d^* \) is the optimal deletion action determined by experts to minimize the Levenshtein distance to the target output \( y^* \), that is, the SMILES of molecule M. Through these improvements, the SMI - E DITOR model can learn the sub - structure semantic information of molecules more effectively and has achieved excellent performance in multiple molecular property prediction tasks.