Abstract:With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes' positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to effectively extract key information from large - scale graph data and generate a condensed graph in the training of graph neural networks (GNNs), so as to reduce the consumption of computing resources while maintaining or improving the generalization ability of the model in various downstream tasks. Specifically, the existing Graph Condensation (GC) methods mainly rely on classification tasks as proxy tasks for optimization, which leads to the following two main problems:
1. **Label Dependence**: The existing GC methods rely heavily on node labels, but in actual scenarios, large - scale graph data often has the problem of label scarcity. This dependence limits the application of GC methods in the real world.
2. **Limited Generalization Ability**: Optimization based on classification tasks tends to over - fit the information of specific categories, thus limiting the generalization ability of the generated condensed graph in other downstream tasks.
To solve these problems, the paper proposes a new self - supervised learning framework - Contrastive Graph Condensation (CTGC). CTGC extracts the key causal information in the original graph by introducing contrastive learning tasks and enhances the generalization ability of the condensed graph among different tasks. Specific improvements include:
- **Two - Branch Framework**: CTGC adopts a two - branch architecture to process semantic information and structural information respectively. The semantic branch is responsible for processing node attributes, while the structural branch explicitly encodes geometric information, such as the position embedding of nodes.
- **Alternating Optimization Mechanism**: By alternately optimizing the semantic and structural branches, CTGC promotes the mutual enhancement between the two branches and ensures high - quality graph generation.
- **Contrastive Loss Function**: By using the contrastive loss function, CTGC can effectively compress and extract the key information of the graph without relying on labels, thereby improving the generalization ability across tasks.
Through these improvements, CTGC not only reduces the dependence on labels, but also significantly improves the performance of the condensed graph in various downstream tasks, especially in the case of label scarcity.
### Formula Summary
- **Graph Convolution Operation**:
\[
H^{(k)}=\text{ReLU}\left(\hat{A}H^{(k - 1)}W^{(k)}\right)
\]
where \(H^{(k)}\) is the node embedding of the \(k\)-th layer, \(\hat{A} = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\) is the normalized adjacency matrix, and \(W^{(k)}\) is the trainable weight.
- **Contrastive Loss Function**:
\[
L_{\text{clu}}(H, H', y^H)=-\sum_{i = 0}^{N}\log\frac{\exp(\text{sim}(H_i, H'_{y^H_i})/\tau)}{\sum_{j = 0}^{N'}[j\neq y^H_i]\exp(\text{sim}(H_i, H'_j)/\tau)}
\]
\[
L_{\text{cen}}(H')=-\sum_{i = 0}^{N'}\log\frac{\exp(\text{sim}(H'_i, H'_i)/\tau)}{\sum_{j = 0}^{N'}[j\neq i]\exp(\text{sim}(H'_i, H'_j)/\tau)}
\]
where \(\text{sim}(\cdot,\cdot)\) represents cosine similarity and \(\tau\) is the temperature parameter.
- **Joint Loss Function**:
\[
L(H, H', y^H)=L_{\text{clu}}(H, H', y^H)+\alpha L_{\text{cen}}(H')
\]
where \(\alpha\) is the weight to balance the two loss terms.
Through these formulas and methods, CTGC successfully solves the problems of the existing GC methods.