Abstract:Property prediction on molecular graphs is an important application of Graph Neural Networks. Recently, unlabeled molecular data has become abundant, which facilitates the rapid development of self-supervised learning for GNNs in the chemical domain. In this work, we propose pretraining GNNs at the fragment level, a promising middle ground to overcome the limitations of node-level and graph-level pretraining. Borrowing techniques from recent work on principal subgraph mining, we obtain a compact vocabulary of prevalent fragments from a large pretraining dataset. From the extracted vocabulary, we introduce several fragment-based contrastive and predictive pretraining tasks. The contrastive learning task jointly pretrains two different GNNs: one on molecular graphs and the other on fragment graphs, which represents higher-order connectivity within molecules. By enforcing consistency between the fragment embedding and the aggregated embedding of the corresponding atoms from the molecular graphs, we ensure that the embeddings capture structural information at multiple resolutions. The structural information of fragment graphs is further exploited to extract auxiliary labels for graph-level predictive pretraining. We employ both the pretrained molecular-based and fragment-based GNNs for downstream prediction, thus utilizing the fragment information during finetuning. Our graph fragment-based pretraining (GraphFP) advances the performances on 5 out of 8 common molecular benchmarks and improves the performances on long-range biological benchmarks by at least 11.5%. Code is available at: <a class="link-external link-https" href="https://github.com/lvkd84/GraphFP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper primarily aims to address the challenges faced when applying Graph Neural Networks (GNNs) for property prediction on molecular graphs, particularly how to effectively utilize large-scale unlabeled molecular data to improve the performance of GNNs. Specifically, the paper proposes a new framework called GraphFP, which overcomes the limitations of node-level and graph-level pretraining in existing methods through fragment-level pretraining. ### Main Issues Addressed 1. **Data Hunger Problem**: Although modern GNNs can efficiently represent chemical structures and achieve state-of-the-art performance in many tasks, these models often require a large amount of labeled data to avoid overfitting. However, in the chemical domain, available molecular datasets are usually small in scale, which limits the generalization ability of GNNs. 2. **Effective Design of Self-Supervised Learning**: With the emergence of large amounts of unlabeled molecular data, researchers have begun to explore self-supervised learning techniques to leverage this data. However, how to effectively design self-supervised pretraining tasks suitable for the chemical domain remains an open problem. 3. **Fragment-Level Representation Learning**: Existing pretraining methods either focus on the node level or the graph level, both of which have their limitations. Node-level methods may be limited to capturing local patterns while ignoring higher-order structural arrangements; graph-level methods may overlook finer-grained details. Therefore, a new direction is needed to overcome these limitations, namely fragment-based pretraining. ### Solutions - **Proposing the GraphFP Framework**: This is a novel contrastive pretraining framework that can capture both fine-grained and global patterns at the fragment level. For each molecule, GraphFP obtains two representations: a molecular graph and a fragment graph, each processed by different GNNs. - **Fragment Extraction and Representation**: To generate a vocabulary of molecular fragments, the paper utilizes a subgraph mining algorithm to extract optimized common fragments. This method can produce a concise and diverse fragment vocabulary, where each fragment appears frequently enough without sacrificing fragment size. - **Contrastive and Predictive Pretraining Tasks**: The paper defines several pretraining tasks, including contrastive learning tasks and fragment-based predictive tasks. These tasks encourage the model to learn the structural information of molecular fragments, thereby enhancing the understanding of molecules. - **Model Integration for Downstream Tasks**: Ultimately, GraphFP uses the molecular encoder and fragment encoder obtained from pretraining to make predictions for downstream tasks, fully leveraging the fragment information. Through these methods, GraphFP achieves excellent results on multiple benchmarks, showing significant performance improvements, especially on common chemical benchmarks and long-range biological datasets.

Fragment-based Pretraining and Finetuning on Molecular Graphs

Molecular Representation Contrastive Learning Via Transformer Embedding to Graph Neural Networks

Towards Effective and Generalizable Fine-tuning for Pre-trained Molecular Graph Models

Enhancing molecular property prediction with auxiliary learning and task-specific adaptation

A knowledge-guided pre-training framework for improving molecular representation learning

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction

Leveraging 2D molecular graph pretraining for improved 3D conformer generation with graph neural networks

Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property Prediction

Learn molecular representations from large-scale unlabeled molecules for drug discovery

Describe Molecules by a Heterogeneous Graph Neural Network with Transformer-like Attention for Supervised Property Predictions

KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

FragNet: A Graph Neural Network for Molecular Property Prediction with Four Layers of Interpretability

CasANGCL: pre-training and fine-tuning model based on cascaded attention network and graph contrastive learning for molecular property prediction

An effective self-supervised framework for learning expressive molecular global representations to drug discovery

Dual-view Molecular Pre-training

FraGAT: a Fragment-Oriented Multi-Scale Graph Attention Model for Molecular Property Prediction.

Strategies for Pre-training Graph Neural Networks

Learning to Pre-train Graph Neural Networks.

Graph Neural Tree: A novel and interpretable deep learning-based framework for accurate molecular property predictions

HiGNN: Hierarchical Informative Graph Neural Networks for Molecular Property Prediction Equipped with Feature-Wise Attention