Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

Alexius Wadell,Anoushka Bhutani,Venkatasubramanian Viswanathan
2024-09-19
Abstract:Molecular Foundation Models are emerging as powerful tools for accelerating molecular design, material science, and cheminformatics, leveraging transformer architectures to speed up the discovery of new materials and drugs while reducing the computational cost of traditional ab initio methods. However, current models are constrained by closed-vocabulary tokenizers that fail to capture the full diversity of molecular structures. In this work, we systematically evaluate thirteen chemistry-specific tokenizers for their coverage of the SMILES language, uncovering substantial gaps. Using N-gram language models, we accessed the impact of tokenizer choice on model performance and quantified the information loss of unknown tokens. We introduce two new tokenizers, <i>smirk</i> and <i>smirk-gpe</i>, which can represent the entirety of the OpenSMILES specification while avoiding the pitfalls of existing tokenizers. Our work highlights the importance of open-vocabulary modeling for molecular foundation models and the need for chemically diverse benchmarks for cheminformatics.
Machine Learning,Artificial Intelligence,Chemical Physics
What problem does this paper attempt to address?
The paper aims to address the vocabulary limitations of molecular foundation models in chemical language processing. Specifically: 1. **Problems with existing models**: Current molecular foundation models are constrained by closed-vocabulary tokenizers, which fail to capture the full diversity of molecular structures. This limitation leads to information loss, especially when dealing with unknown vocabulary. 2. **Evaluation of existing tokenizers**: The paper systematically evaluates 13 chemistry-specific tokenizers for their coverage of the SMILES language and identifies significant gaps. Using an N-gram language model, the researchers analyze the impact of different tokenizer choices on model performance and quantify the information loss caused by unknown vocabulary. 3. **Introduction of new tokenizers**: The paper introduces two new tokenizers, `smirk` and `smirk-gpe`, which can represent all elements in the OpenSMILES specification while avoiding the shortcomings of existing tokenizers. The `smirk` tokenizer avoids vocabulary explosion by decomposing atoms within brackets into their constituent elements, while `smirk-gpe` further compresses the token sequences using a BPE-like method. 4. **Importance of open vocabulary modeling**: The paper emphasizes the importance of open vocabulary modeling for molecular foundation models and highlights the need for more benchmark datasets with chemical diversity to evaluate cheminformatics models. In summary, the paper focuses on improving tokenizer design in molecular foundation models to enhance their ability to handle the diversity of molecular structures and reduce information loss due to vocabulary limitations.