Improving the reliability of molecular string representations for generative chemistry

Etienne Reboul,Zoe Wefers,Jerome Waldispuhl,Antoine Taly
DOI: https://doi.org/10.1101/2024.10.07.617002
2024-10-11
Abstract:Generative chemistry has seen rapid development recently. However, models based on string representations of molecules still rely largely on SMILES and SELFIES that have not been developed for this context. The goal of this study is to first analyze the difficulty encountered by a small generative model when using SMILES and SELFIES. Our study found that SELFIES and canonical SMILES 3 are not fully reliable representations, i.e. do not ensure both the viability and fidelity of samples. Viable samples represent novel, unique molecules with correct valence, while fidelity ensures the accurate reproduction of chemical properties from the training set. In fact, 20% of the samples generated using Canonical SMILES as input representation do not correspond to valid molecules. At variance, samples generated using SELFIES less faithfully reproduce the chemical properties of the training dataset. As a mitigation strategy of the previously identified problems we have developed data augmentation procedures for both SELFIES and SMILES. Simplifying the complex syntax of SELFIES yielded only marginal improvements in stability and overall fidelity to the training set. For SMILES, we developed a stochastic data augmentation procedure called ClearSMILES, which reduces the vocabulary size needed to represent a SMILES dataset, explicitly represents aromaticity via Kekule SMILES, 3 and reduces the effort required by deep learning models to process SMILES. ClearSMILES reduced the error rate in samples by an order of magnitude, from 20% to 2.2%, and improved the fidelity of samples to the training set.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address the reliability issues in molecular string representations (such as SMILES and SELFIES) in generative chemistry when generating novel compounds. Specifically, the study found the following issues with existing SMILES and SELFIES representation methods when generating molecules: 1. **Validity issue of SMILES**: When using standard SMILES as input representation, 20% of the generated samples do not correspond to valid molecules. 2. **Fidelity issue of SELFIES**: Although SELFIES ensures the validity of generated molecules, the generated molecules are not as faithful in chemical properties as those in the training dataset. To mitigate these issues, the researchers developed data augmentation methods for SMILES and SELFIES. For SMILES, they proposed a method called ClearSMILES, which simplifies the SMILES syntax to reduce the vocabulary size and explicitly represents aromaticity, thereby improving the validity and fidelity of the generated samples. For SELFIES, they attempted to simplify its complex hexadecimal encoding method to enhance its stability. These improvements aim to enhance the reliability and performance of generative models in generating new compounds.