Invalid SMILES are beneficial rather than detrimental to chemical language models

Michael A. Skinnider
DOI: https://doi.org/10.1038/s42256-024-00821-x
IF: 23.8
2024-03-30
Nature Machine Intelligence
Abstract:Generative machine learning models have attracted intense interest for their ability to sample novel molecules with desired chemical or biological properties. Among these, language models trained on SMILES (Simplified Molecular-Input Line-Entry System) representations have been subject to the most extensive experimental validation and have been widely adopted. However, these models have what is perceived to be a major limitation: some fraction of the SMILES strings that they generate are invalid, meaning that they cannot be decoded to a chemical structure. This perceived shortcoming has motivated a remarkably broad spectrum of work designed to mitigate the generation of invalid SMILES or correct them post hoc. Here I provide causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models. I show that the generation of invalid outputs provides a self-corrective mechanism that filters low-likelihood samples from the language model output. Conversely, enforcing valid outputs produces structural biases in the generated molecules, impairing distribution learning and limiting generalization to unseen chemical space. Together, these results refute the prevailing assumption that invalid SMILES are a shortcoming of chemical language models and reframe them as a feature, not a bug.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?
The paper primarily explores the issue of generating invalid SMILES (Simplified Molecular-Input Line-Entry System) representations in chemical language models and presents a counterintuitive viewpoint: generating invalid SMILES is actually beneficial rather than harmful for chemical language models. The traditional view holds that the generation of invalid SMILES by chemical language models is a major flaw because these invalid strings cannot be decoded into valid chemical structures. To address this issue, researchers have proposed various methods to reduce or correct these invalid outputs. However, the authors of this paper provide a series of experimental evidence showing that the ability to generate invalid SMILES is not a drawback but rather an advantage for chemical language models. Specifically, the main contributions of the paper include: 1. **Invalid SMILES as a self-correcting mechanism**: The authors found that invalid SMILES usually have lower probability scores, which means that filtering out these low-quality samples can improve the overall performance of the model. 2. **Causal evidence support**: By modifying the rules of SELFIES (SELF-referencIng Embedded Strings, a text representation designed to generate valid chemical structures) to allow the model to generate some invalid SELFIES, the results show that this ability indeed improves the model's performance. 3. **Impact of structural bias**: Forcing the generation of valid outputs leads to structural bias in chemical space exploration, particularly a tendency to generate more aliphatic compounds while neglecting aromatic compounds, which can impair the model's ability to learn distribution and generalize to unseen chemical spaces. 4. **Role of invalid outputs in structure elucidation**: The authors further demonstrate the importance of this ability in the task of complex natural product structure elucidation, especially in cases with limited experimental data. In summary, the paper challenges the traditional view regarding invalid SMILES in chemical language models and provides experimental evidence that generating invalid outputs can actually help the model better explore chemical space and improve its overall performance.