Noisemol: A Noise-Robusted Data Augmentation Via Perturbing Noise for Molecular Property Prediction
Jing Jiang,Ruisheng Zhang,Yongna Yuan,Tongfeng Li,Gaili Li,Zhili Zhao,Zhixuan Yu
DOI: https://doi.org/10.1016/j.jmgm.2023.108454
IF: 2.942
2023-01-01
Journal of Molecular Graphics and Modelling
Abstract:Simplified Molecular-Input Line-Entry System (SMILES) is one of a widely used molecular representation methods for molecular property prediction. We conjecture that all the characters in the SMILES string of a molecule are essential for making up the molecules, but most of them make little contribution to determining a particular property of the molecule. Therefore, we verified the conjecture in the pre-experiment. Motivated by the result, we propose to inject proper noisy information into the SMILES to augment the training data by increasing the diversity of the labeled molecules. To this end, we explore injecting perturbing noise into the original labeled SMILES strings to construct augmented data for alleviating the limitation of the labeled compound data and enhancing the model to extract more useful molecular representation for molecular property prediction. Specifically, we directly adopt mask, swap, deletion, and fusion operations on SMILES strings to randomly mask, swap, and delete atoms in SMILES strings. Then, the augmented data is used by two strategies: each epoch alternately feeds the original and perturbing noisy molecules, or each batch alternately feeds the original and perturbing noisy molecules. We conduct experiments on both Transformer and BiGRU models to validate the effectiveness by adopting widely used datasets from MoleculeNet and ZINC. Experimental results demonstrate that the proposed method outperforms strong baselines on all the datasets. NoiseMol obtains the best performance on BBBP and FDA when compared with state-of-the-art methods. Besides, NoiseMol achieves the best accuracy on LogP. Therefore, injecting perturbing noise into the labeled SMILES strings is an effective and efficient method, which improves the prediction performance, generalization, and robustness of the deep learning models.