A Large Encoder-Decoder Family of Foundation Models For Chemical Language

Eduardo Soares,Victor Shirasuna,Emilio Vital Brazil,Renato Cerqueira,Dmitry Zubarev,Kristin Schmidt
2024-07-25
Abstract:Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8\times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.
Machine Learning,Artificial Intelligence,Chemical Physics
What problem does this paper attempt to address?
The paper aims to address the issues of molecular property prediction and molecular generation in the field of chemistry, specifically including the following aspects: 1. **Proposing a Novel Foundation Model**: The paper introduces a large-scale encoder-decoder family foundation model named SMI-TED289M, which is designed for chemical language and pre-trained on a large, carefully selected dataset. 2. **Problems Addressed**: - Accelerating the discovery process in various fields such as drug development and materials science by predicting molecular properties to reduce the cost and time consumption of traditional experimental methods. - Enhancing the capabilities of cheminformatics using large-scale pre-training techniques, particularly through self-supervised learning to learn context-aware input representations from unlabeled datasets. - Reducing the reliance on annotated datasets and expanding the understanding of chemical language representations. 3. **Datasets and Model Structure**: - Pre-training was conducted using 91 million SMILES samples from the PubChem database (equivalent to 4 billion molecular tokens). - The model includes two main variants: the base version with 289 million parameters, and the mixture of experts version (SMI-TED8x289M) consisting of 8 such base models, totaling 2.272 billion parameters. 4. **Experimental Results**: - Experiments on multiple benchmark datasets demonstrate that the model exhibits state-of-the-art performance in various tasks, including quantum property prediction, physical property prediction, etc. - A comparison between frozen weights and fine-tuned models shows that fine-tuning can further enhance the model's performance. - The model's decoding ability was evaluated, proving its superiority on the MOSES benchmark dataset. - The mixture of experts version (SMI-TED8x289M) performs better in molecular property prediction tasks. - The latent space of the model was studied, demonstrating the composability of the latent space, which provides strong support for chemical reasoning tasks. In summary, the paper is primarily dedicated to developing an efficient foundation model to address the issue of molecular property prediction in the field of chemistry, and extensive experiments have validated the effectiveness and advancement of the proposed model.