MoleMCL: a multi-level contrastive learning framework for molecular pre-training

Xinyi Zhang,Yanni Xu,Changzhi Jiang,Lian Shen,Xiangrong Liu
DOI: https://doi.org/10.1093/bioinformatics/btae164
IF: 5.8
2024-03-26
Bioinformatics
Abstract:Abstract Motivation Molecular representation learning plays an indispensable role in crucial tasks such as property prediction and drug design. Despite the notable achievements of Molecular Pre-training Models (MPMs), current methods often fail to capture both the structural and feature semantics of molecular graphs. Moreover, while graph contrastive learning has unveiled new prospects, existing augmentation techniques often struggle to retain their core semantics. To overcome these limitations, we propose a gradient-compensated encoder parameter perturbation approach, ensuring efficient and stable feature augmentation. By merging enhancement strategies grounded in attribute masking and parameter perturbation, we introduce MoleMCL, a new MOLEcular pre-training model based on Multi-level Contrastive Learning. Results Experimental results demonstrate that MoleMCL adeptly dissects the structure and feature semantics of molecular graphs, surpassing current state-of-the-art models in molecular prediction tasks, paving a novel avenue for molecular modeling. Availability and implementation The code and data underlying this work are available in GitHub at https://github.com/BioSequenceAnalysis/MoleMCL. Contact 23020221154148@stu.xmu.edu.cn Supplementary information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?
The paper aims to address two major issues in molecular representation learning: 1. **Insufficient capture of structural and feature semantics**: Existing molecular pre-training models often struggle to simultaneously capture the structural and feature semantics of molecular graphs. This limitation affects the model's performance in critical tasks such as property prediction and drug design. 2. **Core semantic retention challenge in contrastive learning**: Although contrastive learning offers a new perspective for molecular pre-training, existing data augmentation techniques often struggle to effectively enhance while retaining the core semantics of molecules. For example, structure-based augmentation methods may fail to handle molecular activity cliffs correctly, where structurally similar molecules may have drastically different properties. To address the above issues, the paper proposes MoleMCL, a molecular pre-training framework based on multi-level contrastive learning. Specifically, MoleMCL includes the following innovations: - **Gradient compensation parameter perturbation strategy**: To perturb model parameters more scientifically and ensure stability and effectiveness during the contrastive learning process, the paper introduces a gradient compensation scheme. This method uses gradients from previous contrastive learning stages to guide parameter perturbation, thereby avoiding semantic disruption that simple Gaussian noise perturbation might cause. - **Augmentation strategy combining attribute masking and parameter perturbation**: To simultaneously capture the structural and feature semantics of molecular graphs, the paper combines attribute masking techniques with parameter perturbation, proposing the MoleMCL framework. This integrated approach enables the model to learn molecular representations at different levels. Experimental results show that MoleMCL performs excellently in molecular property prediction tasks, outperforming current state-of-the-art models. It can serve as a foundational pre-training task for various graph neural network architectures, enhancing their performance. Additionally, molecular retrieval experiments further validate that MoleMCL can generate chemically meaningful representations, demonstrating the effectiveness of the proposed multi-level contrastive learning method.