Abstract:Large-scale pre-training methodologies for chemical language models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Typically, this involves pre-training on unlabeled data followed by fine-tuning on specific tasks, reducing dependence on annotated datasets and broadening chemical language representation understanding. This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. The proposed foundation model supports different complex tasks, including quantum property prediction, and offer flexibility with two main variants (289M and $8\times289M$). Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks. We also provide a preliminary assessment of the compositionality of the embedding space as a prerequisite for the reasoning tasks. We demonstrate that the produced latent space is separable compared to the state-of-the-art with few-shot learning capabilities.

What problem does this paper attempt to address?

The paper aims to address the issues of molecular property prediction and molecular generation in the field of chemistry, specifically including the following aspects: 1. **Proposing a Novel Foundation Model**: The paper introduces a large-scale encoder-decoder family foundation model named SMI-TED289M, which is designed for chemical language and pre-trained on a large, carefully selected dataset. 2. **Problems Addressed**: - Accelerating the discovery process in various fields such as drug development and materials science by predicting molecular properties to reduce the cost and time consumption of traditional experimental methods. - Enhancing the capabilities of cheminformatics using large-scale pre-training techniques, particularly through self-supervised learning to learn context-aware input representations from unlabeled datasets. - Reducing the reliance on annotated datasets and expanding the understanding of chemical language representations. 3. **Datasets and Model Structure**: - Pre-training was conducted using 91 million SMILES samples from the PubChem database (equivalent to 4 billion molecular tokens). - The model includes two main variants: the base version with 289 million parameters, and the mixture of experts version (SMI-TED8x289M) consisting of 8 such base models, totaling 2.272 billion parameters. 4. **Experimental Results**: - Experiments on multiple benchmark datasets demonstrate that the model exhibits state-of-the-art performance in various tasks, including quantum property prediction, physical property prediction, etc. - A comparison between frozen weights and fine-tuned models shows that fine-tuning can further enhance the model's performance. - The model's decoding ability was evaluated, proving its superiority on the MOSES benchmark dataset. - The mixture of experts version (SMI-TED8x289M) performs better in molecular property prediction tasks. - The latent space of the model was studied, demonstrating the composability of the latent space, which provides strong support for chemical reasoning tasks. In summary, the paper is primarily dedicated to developing an efficient foundation model to address the issue of molecular property prediction in the field of chemistry, and extensive experiments have validated the effectiveness and advancement of the proposed model.

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

A Systematic Survey of Chemical Pre-trained Models

Large-scale chemical language representations capture molecular structure and properties

ChemDFM: A Large Language Foundation Model for Chemistry

Chemical Language Model Linker: blending text and molecules with modular adapters

nach0: Multimodal Natural and Chemical Languages Foundation Model

A Foundation Model for Chemical Design and Property Prediction

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Bidirectional generation of structure and properties through a single molecular foundation model

Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

Multilingual Molecular Representation Learning via Contrastive Pre-training

GP-MoLFormer: A Foundation Model For Molecular Generation

BatGPT-Chem: A Foundation Large Model For Chemical Engineering

Improving Molecular Properties Prediction Through Latent Space Fusion

Large language model for molecular chemistry

Leveraging Chemistry Foundation Models to Facilitate Structure Focused Retrieval Augmented Generation in Multi-Agent Workflows for Catalyst and Materials Design

Small Molecule Optimization with Large Language Models

Large Language Models as Molecular Design Engines

Fine-tuning Large Language Models for Chemical Text Mining

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets