A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence

Xiaofan Zheng,Yoichi Tomiura
DOI: https://doi.org/10.1186/s13321-024-00848-7
2024-06-21
Journal of Cheminformatics
Abstract:Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction
chemistry, multidisciplinary,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently extract molecular structure information from SMILES sequences by machine - learning methods and predict molecular properties. Specifically, the authors proposed a pre - training model based on BERT, aiming to improve the deficiencies of existing natural - language - processing models when dealing with SMILES sequences, so as to more effectively predict molecular properties and odor characteristics. The following are the main contributions of the paper: 1. **Proposing a new pre - training model**: This model is optimized for the characteristics of SMILES sequences, especially considering the low dependence of symbols on the context environment and the fact that one compound can correspond to multiple SMILES sequences. This 2 - encoder pre - training model shows higher robustness in molecular - property - prediction tasks. 2. **Molecular - property prediction**: By fine - tuning the pre - training model, the authors evaluated the performance of the model on 22 datasets, which cover a variety of molecular properties, such as absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. 3. **Odor - characteristic prediction**: In addition to molecular - property prediction, the authors also paid special attention to the prediction of molecular odor characteristics. They used a dataset containing 98 odor descriptors to evaluate the performance of the model, which has been less involved in previous studies. ### Paper Background Molecules, as microscopic units, form macroscopic substances, and their properties directly affect the application of substances in daily life. Whether it is simple properties such as hydrophilicity or complex properties such as protein - binding ability, they are all affected by the internal structure of molecules. However, obtaining these complex molecular properties through experimental or computational - chemistry methods is costly and time - consuming. Therefore, using machine - learning methods to predict molecular properties has become an efficient alternative. ### Methods and Experiments #### Methods - **SMILES sequences as input**: Compared with graph data and 3D geometric structures, although SMILES sequences cannot intuitively express the relationships between atoms, the model can learn implicit structural information through unsupervised learning. - **2 - encoder pre - training model**: This model consists of two encoders. The first encoder inputs the standard SMILES sequence and a special character 'cls', and its output is regarded as molecular embedding. The second encoder inputs randomly masked SMILES sequences, and the output is the recovered SMILES sequences. In this way, the model can more accurately recover the masked SMILES sequences, thereby obtaining more complete molecular - structure information. #### Experiments - **Pre - training stage**: Use 100,000 SMILES sequences for pre - training, and compare the performance of the BERT MLM model and the 2 - encoder model. - **Molecular - property prediction**: Evaluate the prediction performance of the model on 22 ADMET - property datasets. - **Odor - characteristic prediction**: Evaluate the performance of the model on a dataset containing 98 odor descriptors. ### Results and Discussions - **Pre - training results**: The 2 - encoder model is significantly superior to the BERT MLM model in symbol - recovery accuracy, even when the masking rate is higher. - **Molecular - property - prediction results**: The 2 - encoder model achieved the best results on 14 datasets, while the BERT MLM model performed best on 6 datasets, and the non - pre - training model performed best on 2 datasets. - **Odor - characteristic - prediction results**: The 2 - encoder model also performs well in the odor - characteristic - prediction task, especially significantly better than other models on some datasets. ### Conclusion The 2 - encoder pre - training model proposed in this study shows high performance in both molecular - property - prediction and odor - characteristic - prediction tasks, especially having an advantage in dealing with complex molecular - structure information. This model provides new tools and methods for future molecular - property prediction and drug discovery.