Abstract:Deep learning methods have been considered promising for accelerating molecular screening in drug discovery and material design. Due to the limited availability of labelled data, various self-supervised molecular pre-training methods have been presented. While many existing methods utilize common pre-training tasks in computer vision (CV) and natural language processing (NLP), they often overlook the fundamental physical principles governing molecules. In contrast, applying denoising in pre-training can be interpreted as an equivalent force learning, but the limited noise distribution introduces bias into the molecular distribution. To address this issue, we introduce a molecular pre-training framework called fractional denoising (Frad), which decouples noise design from the constraints imposed by force learning equivalence. In this way, the noise becomes customizable, allowing for incorporating chemical priors to significantly improve molecular distribution modeling. Experiments demonstrate that our framework consistently outperforms existing methods, establishing state-of-the-art results across force prediction, quantum chemical properties, and binding affinity tasks. The refined noise design enhances force accuracy and sampling coverage, which contribute to the creation of physically consistent molecular representations, ultimately leading to superior predictive performance.

What problem does this paper attempt to address?

The paper primarily aims to address several key issues in Molecular Property Prediction (MPP): 1. **The issue of limited labeled data**: In fields such as drug discovery and materials design, traditional first-principles computational methods and wet lab experiments are costly. Therefore, using deep learning methods to accelerate molecular screening has become a promising approach. However, one of the challenges these deep learning methods face is the very limited labeled molecular data available for training. 2. **Enhancing the effectiveness of self-supervised pre-training methods**: To alleviate the problem of insufficient labeled data, researchers have proposed various self-supervised molecular pre-training methods to leverage the intrinsic information in unlabeled molecular data. Most existing methods draw on techniques from computer vision (CV) and natural language processing (NLP), such as contrastive learning and masking. However, directly applying pre-training tasks from the CV and NLP fields to molecular data may overlook the fundamental chemical properties and physical principles of molecules, leading to suboptimal representation performance. 3. **Learning molecular representations with physical consistency**: Some recent denoising methods approximate learning atomic forces by learning noise on molecular conformations, thereby providing a physically interpretable pre-training task. However, the types of noise used in these methods are limited, resulting in biased molecular distribution modeling, which in turn affects the performance of downstream tasks. To address the above issues, the paper proposes a novel molecular pre-training framework called "Fractional Denoising" (Frad). The core contribution of Frad is that it can maintain physical consistency while allowing the introduction of chemical prior knowledge to optimize molecular distribution modeling. Specifically, Frad adds mixed noise (including Chemical-Aware Noise (CAN) and Coordinate Gaussian Noise (CGN)) to equilibrium molecular conformations and then trains the model to predict the CGN part from the noisy conformations. This method not only better simulates the true distribution of molecules but also improves the accuracy of atomic forces, thereby enhancing the performance of downstream tasks. In summary, the main goal of this paper is to develop a pre-training framework that can effectively utilize unlabeled data and consider the physical and chemical properties of molecules to improve the accuracy and generalization ability of molecular property prediction.

Pre-training with Fractional Denoising to Enhance Molecular Property Prediction

Pre-training with fractional denoising to enhance molecular property prediction

Fractional Denoising for 3D Molecular Pre-training

Sliced Denoising: A Physics-Informed Molecular Pre-Training Method

3D Denoisers are Good 2D Teachers: Molecular Pretraining via Denoising and Cross-Modal Distillation

Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models

Denoising Drug Discovery Data for Improved ADMET Property Prediction

Denoise Pretraining on Nonequilibrium Molecules for Accurate and Transferable Neural Potentials

Supervised Pretraining for Molecular Force Fields and Properties Prediction

Generalizing Denoising to Non-Equilibrium Structures Improves Equivariant Force Fields

Improving Molecular Pretraining with Complementary Featurizations

Pre-training Protein Models with Molecular Dynamics Simulations for Drug Binding

Learning data efficient coarse-grained molecular dynamics from forces and noise

May the Force be with You: Unified Force-Centric Pre-Training for 3D Molecular Conformations

Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries

Automated 3D Pre-Training for Molecular Property Prediction

Fragment-based Pretraining and Finetuning on Molecular Graphs

On Data Imbalance in Molecular Property Prediction with Pre-training

Physics-augmented Deep Learning with Adversarial Domain Adaptation: Applications to STM Image Denoising

Quantum-Informed Molecular Representation Learning Enhancing ADMET Property Prediction

Dual-view Molecular Pre-training