Improving the reliability of molecular string representations for generative chemistry

Etienne Reboul,Zoe Wefers,Jerome Waldispuhl,Antoine Taly

DOI: https://doi.org/10.1101/2024.10.07.617002

2024-10-11

Abstract:Generative chemistry has seen rapid development recently. However, models based on string representations of molecules still rely largely on SMILES and SELFIES that have not been developed for this context. The goal of this study is to first analyze the difficulty encountered by a small generative model when using SMILES and SELFIES. Our study found that SELFIES and canonical SMILES 3 are not fully reliable representations, i.e. do not ensure both the viability and fidelity of samples. Viable samples represent novel, unique molecules with correct valence, while fidelity ensures the accurate reproduction of chemical properties from the training set. In fact, 20% of the samples generated using Canonical SMILES as input representation do not correspond to valid molecules. At variance, samples generated using SELFIES less faithfully reproduce the chemical properties of the training dataset. As a mitigation strategy of the previously identified problems we have developed data augmentation procedures for both SELFIES and SMILES. Simplifying the complex syntax of SELFIES yielded only marginal improvements in stability and overall fidelity to the training set. For SMILES, we developed a stochastic data augmentation procedure called ClearSMILES, which reduces the vocabulary size needed to represent a SMILES dataset, explicitly represents aromaticity via Kekule SMILES, 3 and reduces the effort required by deep learning models to process SMILES. ClearSMILES reduced the error rate in samples by an order of magnitude, from 20% to 2.2%, and improved the fidelity of samples to the training set.

Bioinformatics

What problem does this paper attempt to address?

The paper attempts to address the reliability issues in molecular string representations (such as SMILES and SELFIES) in generative chemistry when generating novel compounds. Specifically, the study found the following issues with existing SMILES and SELFIES representation methods when generating molecules: 1. **Validity issue of SMILES**: When using standard SMILES as input representation, 20% of the generated samples do not correspond to valid molecules. 2. **Fidelity issue of SELFIES**: Although SELFIES ensures the validity of generated molecules, the generated molecules are not as faithful in chemical properties as those in the training dataset. To mitigate these issues, the researchers developed data augmentation methods for SMILES and SELFIES. For SMILES, they proposed a method called ClearSMILES, which simplifies the SMILES syntax to reduce the vocabulary size and explicitly represents aromaticity, thereby improving the validity and fidelity of the generated samples. For SELFIES, they attempted to simplify its complex hexadecimal encoding method to enhance its stability. These improvements aim to enhance the reliability and performance of generative models in generating new compounds.

Improving the reliability of molecular string representations for generative chemistry

SELFIES and the future of molecular string representations

Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation

Domain-Agnostic Molecular Generation with Chemical Feedback

fragSMILES: a Chemical String Notation for Advanced Fragment and Chirality Representation

GP-MoLFormer: A Foundation Model For Molecular Generation

Invalid SMILES are beneficial rather than detrimental to chemical language models

Recent advances in the Self-Referencing Embedding Strings (SELFIES) library

PromptSMILES: Prompting for scaffold decoration and fragment linking in chemical language models

Fuzz testing molecular representation using deep variational anomaly generation

SAFE setup for generative molecular design

GEN: Highly Efficient SMILES Explorer Using Autodidactic Generative Examination Networks

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

Chemical Language Model Linker: blending text and molecules with modular adapters

t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Levenshtein Augmentation Improves Performance of SMILES Based Deep-Learning Synthesis Prediction

IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

CONSMI: Contrastive Learning in the Simplified Molecular Input Line Entry System Helps Generate Better Molecules

SELFormer: Molecular Representation Learning via SELFIES Language Models

3D2SMILES: Translating Physical Molecular Models into Digital DeepSMILES Notations Using Deep Learning