Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

Alexius Wadell,Anoushka Bhutani,Venkatasubramanian Viswanathan

2024-09-19

Abstract:Molecular Foundation Models are emerging as powerful tools for accelerating molecular design, material science, and cheminformatics, leveraging transformer architectures to speed up the discovery of new materials and drugs while reducing the computational cost of traditional ab initio methods. However, current models are constrained by closed-vocabulary tokenizers that fail to capture the full diversity of molecular structures. In this work, we systematically evaluate thirteen chemistry-specific tokenizers for their coverage of the SMILES language, uncovering substantial gaps. Using N-gram language models, we accessed the impact of tokenizer choice on model performance and quantified the information loss of unknown tokens. We introduce two new tokenizers, <i>smirk</i> and <i>smirk-gpe</i>, which can represent the entirety of the OpenSMILES specification while avoiding the pitfalls of existing tokenizers. Our work highlights the importance of open-vocabulary modeling for molecular foundation models and the need for chemically diverse benchmarks for cheminformatics.

Machine Learning,Artificial Intelligence,Chemical Physics

What problem does this paper attempt to address?

The paper aims to address the vocabulary limitations of molecular foundation models in chemical language processing. Specifically: 1. **Problems with existing models**: Current molecular foundation models are constrained by closed-vocabulary tokenizers, which fail to capture the full diversity of molecular structures. This limitation leads to information loss, especially when dealing with unknown vocabulary. 2. **Evaluation of existing tokenizers**: The paper systematically evaluates 13 chemistry-specific tokenizers for their coverage of the SMILES language and identifies significant gaps. Using an N-gram language model, the researchers analyze the impact of different tokenizer choices on model performance and quantify the information loss caused by unknown vocabulary. 3. **Introduction of new tokenizers**: The paper introduces two new tokenizers, `smirk` and `smirk-gpe`, which can represent all elements in the OpenSMILES specification while avoiding the shortcomings of existing tokenizers. The `smirk` tokenizer avoids vocabulary explosion by decomposing atoms within brackets into their constituent elements, while `smirk-gpe` further compresses the token sequences using a BPE-like method. 4. **Importance of open vocabulary modeling**: The paper emphasizes the importance of open vocabulary modeling for molecular foundation models and highlights the need for more benchmark datasets with chemical diversity to evaluate cheminformatics models. In summary, the paper focuses on improving tokenizer design in molecular foundation models to enhance their ability to handle the diversity of molecular structures and reduce information loss due to vocabulary limitations.

Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

PrefixMol: Target- and Chemistry-aware Molecule Design Via Prefix Embedding

Improving the reliability of molecular string representations for generative chemistry

Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

GP-MoLFormer: A Foundation Model For Molecular Generation

t-SMILES: a fragment-based molecular representation framework for de novo ligand design

t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Token-Mol 1.0: Tokenized drug design with large language model

Exploring Data‐Driven Chemical SMILES Tokenization Approaches to Identify Key Protein‐Ligand Binding Moieties

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Infusing Linguistic Knowledge of SMILES into Chemical Language Models

3D2SMILES: Translating Physical Molecular Models into Digital DeepSMILES Notations Using Deep Learning

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

SELFIES and the future of molecular string representations

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

Stepping Back to SMILES Transformers for Fast Molecular Representation Inference

PromptSMILES: Prompting for scaffold decoration and fragment linking in chemical language models

Chemical Language Model Linker: blending text and molecules with modular adapters

Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation