Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient Matching

Tianle Zhang,Yuchen Zhang,Kun Wang,Kai Wang,Beining Yang,Kaipeng Zhang,Wenqi Shao,Ping Liu,Joey Tianyi Zhou,Yang You
2024-09-27
Abstract:Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have raised growing concerns. As one of the most promising directions, graph condensation methods address these issues by employing gradient matching, aiming to condense the full graph into a more concise yet information-rich synthetic set. Though encouraging, these strategies primarily emphasize matching directions of the gradients, which leads to deviations in the training trajectories. Such deviations are further magnified by the differences between the condensation and evaluation phases, culminating in accumulated errors, which detrimentally affect the performance of the condensed graphs. In light of this, we propose a novel graph condensation method named \textbf{C}raf\textbf{T}ing \textbf{R}ationa\textbf{L} trajectory (\textbf{CTRL}), which offers an optimized starting point closer to the original dataset's feature distribution and a more refined strategy for gradient matching. Theoretically, CTRL can effectively neutralize the impact of accumulated errors on the performance of condensed graphs. We provide extensive experiments on various graph datasets and downstream tasks to support the effectiveness of CTRL. Code is released at <a class="link-external link-https" href="https://github.com/NUS-HPC-AI-Lab/CTRL" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the high cost and storage issues in the training process of large - scale graph neural networks (GNNs). Specifically, although existing methods condense graph data through gradient matching, they mainly focus on the matching of gradient directions, resulting in differences in gradient magnitudes and thus biases in the training trajectory. These differences between the condensation stage and the evaluation stage are further amplified, ultimately leading to cumulative errors and affecting the performance of the condensed graph. To address these problems, the authors propose a new graph condensation method - **CrafTingRational gradient matching (CTRL)**. This method optimizes the initial point to make it closer to the feature distribution of the original data set and adopts a more refined gradient - matching strategy, aiming to reduce the matching error and effectively alleviate the impact of cumulative errors on the performance of the condensed graph. ### Main contributions: 1. **Introduction of CTRL**: Based on existing graph condensation research, especially gradient - matching techniques, a simple and highly general graph data set condensation method is proposed. 2. **Optimization of the training trajectory of synthetic data**: By weighted combination of cosine distance and Euclidean distance, the optimization trajectory of synthetic data is made closer to the real data and effectively captures the feature distribution of the real data. 3. **Extensive experimental verification**: Experimental evaluations were carried out in 18 node - classification tasks and 18 graph - classification tasks. The results show that the state - of - the - art performance was achieved in 34 experiments on 12 data sets, and lossless performance was achieved on 5 data sets. ### Specific problem descriptions: - **High cost and storage issues**: The training of large - scale graph data requires a large amount of computing resources and storage space. - **Biases in gradient matching**: Existing methods mainly focus on the matching of gradient directions, ignoring the differences in gradient magnitudes, resulting in biases in the training trajectory. - **Cumulative errors**: The differences between the condensation stage and the evaluation stage lead to cumulative errors, affecting the performance of the condensed graph. ### Solutions: - **Combination of cosine distance and Euclidean distance**: By linearly combining cosine similarity and Euclidean distance, the matching of gradient directions and magnitudes is ensured. - **Optimization of the initial point**: By clustering the original data and sampling from each sub - cluster, the feature distribution of the initial synthetic data is ensured to be closer to the original data. - **Theoretical analysis**: By defining cumulative error, matching error and initialization error, a detailed theoretical analysis is provided to prove the effectiveness of CTRL. Through these improvements, CTRL not only improves the quality of synthetic graphs, but also significantly reduces the impact of cumulative errors on model performance, providing a potentially new path for the efficient training and storage of large - scale graph data.