Abstract:Accurate property prediction is crucial for accelerating the discovery of new molecules. Although deep learning models have achieved remarkable success, their performance often relies on large amounts of labeled data that are expensive and time-consuming to obtain. Thus, there is a growing need for models that can perform well with limited experimentally-validated data. In this work, we introduce MoleVers, a versatile pretrained model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated molecular property labels are scarce. MoleVers adopts a two-stage pretraining strategy. In the first stage, the model learns molecular representations from large unlabeled datasets via masked atom prediction and dynamic denoising, a novel task enabled by a new branching encoder architecture. In the second stage, MoleVers is further pretrained using auxiliary labels obtained with inexpensive computational methods, enabling supervised learning without the need for costly experimental data. This two-stage framework allows MoleVers to learn representations that generalize effectively across various downstream datasets. We evaluate MoleVers on a new benchmark comprising 22 molecular datasets with diverse types of properties, the majority of which contain 50 or fewer training labels reflecting real-world conditions. MoleVers achieves state-of-the-art results on 20 out of the 22 datasets, and ranks second among the remaining two, highlighting its ability to bridge the gap between data-hungry models and real-world conditions where practically-useful labels are scarce.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the challenges faced in molecular property prediction in real-world scenarios, particularly in the context of scarce experimental validation data. Although deep learning models have achieved significant success in molecular property prediction, these models typically rely on large amounts of labeled data, which is both expensive and time-consuming to obtain. Therefore, there is an urgent need to develop models that perform well with limited experimental validation data. Specifically, the authors propose **MoleVers**, a pre-trained model for various types of molecular property predictions. MoleVers employs a two-stage pre-training strategy to enhance its generalization ability in downstream tasks: 1. **First-stage pre-training**: Learning molecular representations from large-scale unlabeled datasets through Masked Atom Prediction (MAP) and Dynamic Denoising. 2. **Second-stage pre-training**: Further pre-training using auxiliary labels generated by inexpensive computational methods (such as Density Functional Theory, DFT), enabling supervised learning without the need for costly experimental data. ### Main Contributions 1. **Two-stage pre-training framework**: Includes a novel dynamic denoising pre-training method that learns molecular representations without increasing the demand for downstream labels. 2. **Branch encoder architecture**: Decouples the Masked Atom Prediction and denoising pipelines, allowing the model to handle larger noise scales and thus improve generalization ability. 3. **MPPW Benchmark**: Designed a new benchmark comprising 22 small datasets that reflect the data scarcity situation in real-world scenarios. ### Experimental Results - In the MPPW benchmark, MoleVers achieved the best performance on 20 out of 22 datasets and ranked second on the remaining two datasets. - In the MoleculeNet benchmark, MoleVers also outperformed other baseline models on large datasets. These results indicate that the two-stage pre-training strategy of MoleVers can significantly improve the performance of molecular property prediction in data-scarce situations.

Two-Stage Pretraining for Molecular Property Prediction in the Wild

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

Fast and Effective Molecular Property Prediction with Transferability Map

MolPROP: Molecular Property prediction with multimodal language and graph fusion

Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules

KnoMol: A Knowledge-Enhanced Graph Transformer for Molecular Property Prediction

MvMRL: a multi-view molecular representation learning method for molecular property prediction

Advanced deep learning methods for molecular property prediction

Dual-view Molecular Pre-training

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

MolCloze - A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction.

A merged molecular representation learning for molecular properties prediction with a web-based service

Scalable Multi-Task Transfer Learning for Molecular Property Prediction

Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey

DIG-Mol: A Contrastive Dual-Interaction Graph Neural Network for Molecular Property Prediction

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

MolKD: Distilling Cross-Modal Knowledge in Chemical Reactions for Molecular Property Prediction

Molecular Descriptors Property Prediction Using Transformer-Based Approach

Improving Molecular Properties Prediction Through Latent Space Fusion

Analyzing Learned Molecular Representations for Property Prediction