Uni-Mol2: Exploring Molecular Pretraining Model at Scale

Xiaohong Ji,Zhen Wang,Zhifeng Gao,Hang Zheng,Linfeng Zhang,Guolin Ke,Weinan E
2024-07-01
Abstract:In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily focuses on addressing the research gap in the scalability of molecular pretraining models, especially in the application within the fields of biomedicine and materials science. Specifically, the paper introduces Uni-Mol2, an innovative, large-scale molecular pretraining model designed to improve molecular representation learning by integrating features at the atomic level, graph level, and geometric structure level. Utilizing a dual-track transformer architecture, Uni-Mol2 is capable of effectively handling the complexity and diversity of molecules. The key contributions of the paper include: 1. **Large-scale dataset construction**: The authors have constructed a dataset containing approximately 884 million 3D conformations, which is currently the largest dataset for molecular pretraining, providing a foundation for training large molecular models. 2. **Exploration of the scalability of molecular pretraining models**: The paper systematically studies the relationship between Uni-Mol2 model parameters, dataset size, and computational resources, revealing a power-law correlation between validation loss and these factors. This is the first demonstration of scaling laws in the field of molecular pretraining. 3. **Model scale**: Through pretraining, Uni-Mol2 has been successfully scaled to 1.1 billion parameters, making it the largest molecular pretraining model to date. 4. **Improved performance on downstream tasks**: Experiments show that with the increase in model parameters, Uni-Mol2 continues to show improved performance on downstream tasks (such as chemical property prediction on the QM9 and COMPAS-1D datasets), with the 1.1 billion parameter model achieving an average performance increase of 27% on the QM9 dataset and 14% on the COMPAS-1D dataset. 5. **Performance with limited data**: The paper also explores the performance of the model under conditions of limited data, showing that even with constrained data volumes, larger model scales can still bring better predictive performance, especially in situations where training data is scarce. In summary, by proposing the Uni-Mol2 model, the paper not only fills the void in research on the scalability of molecular pretraining models but also demonstrates the significant impact of model scaling on molecular representation learning and downstream task performance, providing a powerful tool for research in the fields of biomedicine and materials science.