Abstract:Generative deep learning models have emerged as a powerful approach for de novo drug design as they aid researchers in finding new molecules with desired properties. Despite continuous improvements in the field, a subset of the outputs that sequence-based de novo generators produce cannot be progressed due to errors. Here, we propose to fix these invalid outputs post hoc. In similar tasks, transformer models from the field of natural language processing have been shown to be very effective. Therefore, here this type of model was trained to translate invalid Simplified Molecular-Input Line-Entry System (SMILES) into valid representations. The performance of this SMILES corrector was evaluated on four representative methods of de novo generation: a recurrent neural network (RNN), a target-directed RNN, a generative adversarial network (GAN), and a variational autoencoder (VAE). This study has found that the percentage of invalid outputs from these specific generative models ranges between 4 and 89%, with different models having different error-type distributions. Post hoc correction of SMILES was shown to increase model validity. The SMILES corrector trained with one error per input alters 60–90% of invalid generator outputs and fixes 35–80% of them. However, a higher error detection and performance was obtained for transformer models trained with multiple errors per input. In this case, the best model was able to correct 60–95% of invalid generator outputs. Further analysis showed that these fixed molecules are comparable to the correct molecules from the de novo generators based on novelty and similarity. Additionally, the SMILES corrector can be used to expand the amount of interesting new molecules within the targeted chemical space. Introducing different errors into existing molecules yields novel analogs with a uniqueness of 39% and a novelty of approximately 20%. The results of this research demonstrate that SMILES correction is a viable post hoc extension and can enhance the search for better drug candidates.

Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Augmentation Is What You Need!

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules

A Novel Molecular Representation Learning for Molecular Property Prediction with a Multiple SMILES-Based Augmentation

Improving the reliability of molecular string representations for generative chemistry

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis

Faster and more diverse de novo molecular optimization with double-loop reinforcement learning using augmented SMILES

CONSMI: Contrastive Learning in the Simplified Molecular Input Line Entry System Helps Generate Better Molecules

SDA: Improving Text Generation with Self Data Augmentation

Synergy Effect between Convolutional Neural Networks and the Multiplicity of SMILES for Improvement of Molecular Prediction

Evaluating the Impact of Data Augmentation on Predictive Model Performance

Stepping Back to SMILES Transformers for Fast Molecular Representation Inference

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis

Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

Augmented Memory: Sample-Efficient Generative Molecular Design with Reinforcement Learning

3D2SMILES: Translating Physical Molecular Models into Digital DeepSMILES Notations Using Deep Learning

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

UnCorrupt SMILES: a novel approach to de novo design