Abstract:In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the research gap in the scalability of molecular pretraining models, especially in the application within the fields of biomedicine and materials science. Specifically, the paper introduces Uni-Mol2, an innovative, large-scale molecular pretraining model designed to improve molecular representation learning by integrating features at the atomic level, graph level, and geometric structure level. Utilizing a dual-track transformer architecture, Uni-Mol2 is capable of effectively handling the complexity and diversity of molecules. The key contributions of the paper include: 1. **Large-scale dataset construction**: The authors have constructed a dataset containing approximately 884 million 3D conformations, which is currently the largest dataset for molecular pretraining, providing a foundation for training large molecular models. 2. **Exploration of the scalability of molecular pretraining models**: The paper systematically studies the relationship between Uni-Mol2 model parameters, dataset size, and computational resources, revealing a power-law correlation between validation loss and these factors. This is the first demonstration of scaling laws in the field of molecular pretraining. 3. **Model scale**: Through pretraining, Uni-Mol2 has been successfully scaled to 1.1 billion parameters, making it the largest molecular pretraining model to date. 4. **Improved performance on downstream tasks**: Experiments show that with the increase in model parameters, Uni-Mol2 continues to show improved performance on downstream tasks (such as chemical property prediction on the QM9 and COMPAS-1D datasets), with the 1.1 billion parameter model achieving an average performance increase of 27% on the QM9 dataset and 14% on the COMPAS-1D dataset. 5. **Performance with limited data**: The paper also explores the performance of the model under conditions of limited data, showing that even with constrained data volumes, larger model scales can still bring better predictive performance, especially in situations where training data is scarce. In summary, by proposing the Uni-Mol2 model, the paper not only fills the void in research on the scalability of molecular pretraining models but also demonstrates the significant impact of model scaling on molecular representation learning and downstream task performance, providing a powerful tool for research in the fields of biomedicine and materials science.

Uni-Mol2: Exploring Molecular Pretraining Model at Scale

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Uncovering Neural Scaling Laws in Molecular Representation Learning

Dual-view Molecular Pre-training

Training Compute-Optimal Protein Language Models

$\texttt{MiniMol}$: A Parameter-Efficient Foundation Model for Molecular Learning

Pretraining Graph Transformer for Molecular Representation with Fusion of Multimodal Information

UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning

Unified 2D and 3D Pre-Training of Molecular Representations

Improving Molecular Pretraining with Complementary Featurizations

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations

Highly Accurate Quantum Chemical Property Prediction with Uni-Mol+

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

May the Force be with You: Unified Force-Centric Pre-Training for 3D Molecular Conformations

Equivariant Pretrained Transformer for Unified Geometric Learning on Multi-Domain 3D Molecules

MolXPT: Wrapping Molecules with Text for Generative Pre-training

UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH

UniMAP: Universal SMILES-Graph Representation Learning

Molecular CT: Unifying Geometry and Representation Learning for Molecules at Different Scales

MoleMCL: a multi-level contrastive learning framework for molecular pre-training

xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein