Abstract:Generative machine learning models have attracted intense interest for their ability to sample novel molecules with desired chemical or biological properties. Among these, language models trained on SMILES (Simplified Molecular-Input Line-Entry System) representations have been subject to the most extensive experimental validation and have been widely adopted. However, these models have what is perceived to be a major limitation: some fraction of the SMILES strings that they generate are invalid, meaning that they cannot be decoded to a chemical structure. This perceived shortcoming has motivated a remarkably broad spectrum of work designed to mitigate the generation of invalid SMILES or correct them post hoc. Here I provide causal evidence that the ability to produce invalid outputs is not harmful but is instead beneficial to chemical language models. I show that the generation of invalid outputs provides a self-corrective mechanism that filters low-likelihood samples from the language model output. Conversely, enforcing valid outputs produces structural biases in the generated molecules, impairing distribution learning and limiting generalization to unseen chemical space. Together, these results refute the prevailing assumption that invalid SMILES are a shortcoming of chemical language models and reframe them as a feature, not a bug.

What problem does this paper attempt to address?

The paper primarily explores the issue of generating invalid SMILES (Simplified Molecular-Input Line-Entry System) representations in chemical language models and presents a counterintuitive viewpoint: generating invalid SMILES is actually beneficial rather than harmful for chemical language models. The traditional view holds that the generation of invalid SMILES by chemical language models is a major flaw because these invalid strings cannot be decoded into valid chemical structures. To address this issue, researchers have proposed various methods to reduce or correct these invalid outputs. However, the authors of this paper provide a series of experimental evidence showing that the ability to generate invalid SMILES is not a drawback but rather an advantage for chemical language models. Specifically, the main contributions of the paper include: 1. **Invalid SMILES as a self-correcting mechanism**: The authors found that invalid SMILES usually have lower probability scores, which means that filtering out these low-quality samples can improve the overall performance of the model. 2. **Causal evidence support**: By modifying the rules of SELFIES (SELF-referencIng Embedded Strings, a text representation designed to generate valid chemical structures) to allow the model to generate some invalid SELFIES, the results show that this ability indeed improves the model's performance. 3. **Impact of structural bias**: Forcing the generation of valid outputs leads to structural bias in chemical space exploration, particularly a tendency to generate more aliphatic compounds while neglecting aromatic compounds, which can impair the model's ability to learn distribution and generalize to unseen chemical spaces. 4. **Role of invalid outputs in structure elucidation**: The authors further demonstrate the importance of this ability in the task of complex natural product structure elucidation, especially in cases with limited experimental data. In summary, the paper challenges the traditional view regarding invalid SMILES in chemical language models and provides experimental evidence that generating invalid outputs can actually help the model better explore chemical space and improve its overall performance.

Invalid SMILES are beneficial rather than detrimental to chemical language models

Improving the reliability of molecular string representations for generative chemistry

UnCorrupt SMILES: a novel approach to de novo design

Chemical Language Model Linker: blending text and molecules with modular adapters

PromptSMILES: Prompting for scaffold decoration and fragment linking in chemical language models

Learning a Generative Model for Validity in Complex Discrete Structures

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

GP-MoLFormer: A Foundation Model For Molecular Generation

GEN: Highly Efficient SMILES Explorer Using Autodidactic Generative Examination Networks

CONSMI: Contrastive Learning in the Simplified Molecular Input Line Entry System Helps Generate Better Molecules

Domain-Agnostic Molecular Generation with Chemical Feedback

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Is BigSMILES the Friend of Polymer Machine Learning?

Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

Keeping it Simple: Language Models can learn Complex Molecular Distributions

SAFE setup for generative molecular design

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation