Two-Stage Pretraining for Molecular Property Prediction in the Wild

Kevin Tirta Wijaya,Minghao Guo,Michael Sun,Hans-Peter Seidel,Wojciech Matusik,Vahid Babaei
2024-11-06
Abstract:Accurate property prediction is crucial for accelerating the discovery of new molecules. Although deep learning models have achieved remarkable success, their performance often relies on large amounts of labeled data that are expensive and time-consuming to obtain. Thus, there is a growing need for models that can perform well with limited experimentally-validated data. In this work, we introduce MoleVers, a versatile pretrained model designed for various types of molecular property prediction in the wild, i.e., where experimentally-validated molecular property labels are scarce. MoleVers adopts a two-stage pretraining strategy. In the first stage, the model learns molecular representations from large unlabeled datasets via masked atom prediction and dynamic denoising, a novel task enabled by a new branching encoder architecture. In the second stage, MoleVers is further pretrained using auxiliary labels obtained with inexpensive computational methods, enabling supervised learning without the need for costly experimental data. This two-stage framework allows MoleVers to learn representations that generalize effectively across various downstream datasets. We evaluate MoleVers on a new benchmark comprising 22 molecular datasets with diverse types of properties, the majority of which contain 50 or fewer training labels reflecting real-world conditions. MoleVers achieves state-of-the-art results on 20 out of the 22 datasets, and ranks second among the remaining two, highlighting its ability to bridge the gap between data-hungry models and real-world conditions where practically-useful labels are scarce.
Machine Learning,Artificial Intelligence,Chemical Physics,Biomolecules
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the challenges faced in molecular property prediction in real-world scenarios, particularly in the context of scarce experimental validation data. Although deep learning models have achieved significant success in molecular property prediction, these models typically rely on large amounts of labeled data, which is both expensive and time-consuming to obtain. Therefore, there is an urgent need to develop models that perform well with limited experimental validation data. Specifically, the authors propose **MoleVers**, a pre-trained model for various types of molecular property predictions. MoleVers employs a two-stage pre-training strategy to enhance its generalization ability in downstream tasks: 1. **First-stage pre-training**: Learning molecular representations from large-scale unlabeled datasets through Masked Atom Prediction (MAP) and Dynamic Denoising. 2. **Second-stage pre-training**: Further pre-training using auxiliary labels generated by inexpensive computational methods (such as Density Functional Theory, DFT), enabling supervised learning without the need for costly experimental data. ### Main Contributions 1. **Two-stage pre-training framework**: Includes a novel dynamic denoising pre-training method that learns molecular representations without increasing the demand for downstream labels. 2. **Branch encoder architecture**: Decouples the Masked Atom Prediction and denoising pipelines, allowing the model to handle larger noise scales and thus improve generalization ability. 3. **MPPW Benchmark**: Designed a new benchmark comprising 22 small datasets that reflect the data scarcity situation in real-world scenarios. ### Experimental Results - In the MPPW benchmark, MoleVers achieved the best performance on 20 out of 22 datasets and ranked second on the remaining two datasets. - In the MoleculeNet benchmark, MoleVers also outperformed other baseline models on large datasets. These results indicate that the two-stage pre-training strategy of MoleVers can significantly improve the performance of molecular property prediction in data-scarce situations.