Dominique Beaini,Shenyang Huang,Joao Alex Cunha,Zhiyi Li,Gabriela Moisescu-Pareja,Oleksandr Dymov,Samuel Maddrell-Mander,Callum McLean,Frederik Wenkel,Luis Müller,Jama Hussein Mohamud,Ali Parviz,Michael Craig,Michał Koziarski,Jiarui Lu,Zhaocheng Zhu,Cristian Gabellini,Kerstin Klaser,Josef Dean,Cas Wognum,Maciej Sypetkowski,Guillaume Rabusseau,Reihaneh Rabbany,Jian Tang,Christopher Morris,Ioannis Koutis,Mirco Ravanelli,Guy Wolf,Prudencio Tossou,Hadrien Mary,Therence Bois,Andrew Fitzgibbon,Błażej Banaszewski,Chad Martin,Dominic Masters

Abstract:Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

What problem does this paper attempt to address?

The paper aims to address the issues of small dataset sizes and sparse labels in the field of molecular machine learning, which limit the development of foundational models. Specifically, the paper addresses these issues through the following aspects: 1. **Constructing Large-Scale Multi-Task Datasets**: The authors propose three new datasets of different scales—ToyMix, LargeMix, and UltraLarge. These datasets contain nearly 100 million molecules and over 3,000 sparsely defined tasks, with a total of over 13 billion individual labels of quantum and biological properties. The scale of these datasets far exceeds existing standard datasets such as OGB-LSC PCQM4Mv2 and QM1B. 2. **Developing the Graphium Library**: To support the development of foundational models based on these new datasets, the authors also provide the Graphium graph machine learning library. This library simplifies the process of building and training molecular machine learning models on multi-task and multi-level molecular datasets. 3. **Baseline Results**: The authors demonstrate the results of various baseline models on these datasets, including model training in single-dataset and multi-dataset scenarios. The results show that on resource-constrained biological datasets, training in combination with a large amount of quantum data can significantly improve performance. This indicates that multi-task and multi-level training may have potential for foundational models and can be fine-tuned for different needs in downstream tasks. Through these contributions, the study advances the field of molecular modeling, particularly in improving model generalization and data efficiency using large-scale supervised training data. Additionally, by providing an easy-to-use tool library (Graphium), researchers can more easily apply advanced graph neural network techniques to large-scale molecular datasets, thus taking a step towards the development of foundational graph neural networks in the field of chemistry.

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

A Systematic Survey of Chemical Pre-trained Models

$\texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular Learning

Learning together: Towards foundation models for machine learning interatomic potentials with meta-learning

Learning Together: Towards foundational models for machine learning interatomic potentials with meta-learning

MoleculeNet: A Benchmark for Molecular Machine Learning

A Large Encoder-Decoder Family of Foundation Models For Chemical Language

From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

Quantum-Informed Molecular Representation Learning Enhancing ADMET Property Prediction

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT

Deep Learning for Multi-Scale Molecular Modeling

Application of quantum-inspired generative models to small molecular datasets

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Extracting Predictive Representations from Hundreds of Millions of Molecules

Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models

MISATO: machine learning dataset of protein-ligand complexes for structure-based drug discovery

Supervised Pretraining for Molecular Force Fields and Properties Prediction

MolFM: A Multimodal Molecular Foundation Model