Abstract:Property prediction on molecular graphs is an important application of Graph Neural Networks. Recently, unlabeled molecular data has become abundant, which facilitates the rapid development of self-supervised learning for GNNs in the chemical domain. In this work, we propose pretraining GNNs at the fragment level, a promising middle ground to overcome the limitations of node-level and graph-level pretraining. Borrowing techniques from recent work on principal subgraph mining, we obtain a compact vocabulary of prevalent fragments from a large pretraining dataset. From the extracted vocabulary, we introduce several fragment-based contrastive and predictive pretraining tasks. The contrastive learning task jointly pretrains two different GNNs: one on molecular graphs and the other on fragment graphs, which represents higher-order connectivity within molecules. By enforcing consistency between the fragment embedding and the aggregated embedding of the corresponding atoms from the molecular graphs, we ensure that the embeddings capture structural information at multiple resolutions. The structural information of fragment graphs is further exploited to extract auxiliary labels for graph-level predictive pretraining. We employ both the pretrained molecular-based and fragment-based GNNs for downstream prediction, thus utilizing the fragment information during finetuning. Our graph fragment-based pretraining (GraphFP) advances the performances on 5 out of 8 common molecular benchmarks and improves the performances on long-range biological benchmarks by at least 11.5%. Code is available at: <a class="link-external link-https" href="https://github.com/lvkd84/GraphFP" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The paper primarily aims to address the challenges faced when applying Graph Neural Networks (GNNs) for property prediction on molecular graphs, particularly how to effectively utilize large-scale unlabeled molecular data to improve the performance of GNNs. Specifically, the paper proposes a new framework called GraphFP, which overcomes the limitations of node-level and graph-level pretraining in existing methods through fragment-level pretraining.
### Main Issues Addressed
1. **Data Hunger Problem**: Although modern GNNs can efficiently represent chemical structures and achieve state-of-the-art performance in many tasks, these models often require a large amount of labeled data to avoid overfitting. However, in the chemical domain, available molecular datasets are usually small in scale, which limits the generalization ability of GNNs.
2. **Effective Design of Self-Supervised Learning**: With the emergence of large amounts of unlabeled molecular data, researchers have begun to explore self-supervised learning techniques to leverage this data. However, how to effectively design self-supervised pretraining tasks suitable for the chemical domain remains an open problem.
3. **Fragment-Level Representation Learning**: Existing pretraining methods either focus on the node level or the graph level, both of which have their limitations. Node-level methods may be limited to capturing local patterns while ignoring higher-order structural arrangements; graph-level methods may overlook finer-grained details. Therefore, a new direction is needed to overcome these limitations, namely fragment-based pretraining.
### Solutions
- **Proposing the GraphFP Framework**: This is a novel contrastive pretraining framework that can capture both fine-grained and global patterns at the fragment level. For each molecule, GraphFP obtains two representations: a molecular graph and a fragment graph, each processed by different GNNs.
- **Fragment Extraction and Representation**: To generate a vocabulary of molecular fragments, the paper utilizes a subgraph mining algorithm to extract optimized common fragments. This method can produce a concise and diverse fragment vocabulary, where each fragment appears frequently enough without sacrificing fragment size.
- **Contrastive and Predictive Pretraining Tasks**: The paper defines several pretraining tasks, including contrastive learning tasks and fragment-based predictive tasks. These tasks encourage the model to learn the structural information of molecular fragments, thereby enhancing the understanding of molecules.
- **Model Integration for Downstream Tasks**: Ultimately, GraphFP uses the molecular encoder and fragment encoder obtained from pretraining to make predictions for downstream tasks, fully leveraging the fragment information.
Through these methods, GraphFP achieves excellent results on multiple benchmarks, showing significant performance improvements, especially on common chemical benchmarks and long-range biological datasets.