Abstract:Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale in the benchmark datasets makes it easy for graph classification models to fall into over-fitting and undergeneralization. To improve this, we introduce data augmentation on graphs (i.e. graph augmentation) and present four methods:random mapping, vertex-similarity mapping, motif-random mapping and motif-similarity mapping, to generate more weakly labeled data for small-scale benchmark datasets via heuristic transformation of graph structures. Furthermore, we propose a generic model evolution framework, named M-Evolve, which combines graph augmentation, data filtration and model retraining to optimize pre-trained graph classifiers. Experiments on six benchmark datasets demonstrate that the proposed framework helps existing graph classification models alleviate over-fitting and undergeneralization in the training on small-scale benchmark datasets, which successfully yields an average improvement of 3 - 13% accuracy on graph classification tasks.
What problem does this paper attempt to address?
This paper attempts to solve the problems of over - fitting and insufficient generalization in graph classification tasks. Specifically, due to the limited size of existing benchmark datasets, graph classification models are prone to over - fitting and insufficient generalization. To solve this problem, the author introduced graph augmentation techniques and proposed four graph augmentation methods: random mapping, vertex - similarity mapping, motif - random mapping, and motif - similarity mapping. These methods heuristically modify and transform the graph structure to generate more weakly - labeled data, thereby expanding the size of small - scale datasets.
In addition, the author also proposed a general model evolution framework M - Evolve, which combines graph augmentation, data filtering, and model retraining to optimize pre - trained graph classifiers. Experimental results show that the M - Evolve framework can significantly improve the performance of existing graph classification models on small - scale benchmark datasets, with an average increase in classification accuracy of 3% to 13%.
### Summary of the core issues in the paper:
1. **Over - fitting and insufficient generalization**: Due to the small size of existing graph classification datasets, models are prone to over - fitting, resulting in insufficient generalization ability.
2. **Data augmentation**: Generate more weakly - labeled data through graph augmentation techniques to expand the size of the training dataset.
3. **Model optimization**: Propose the M - Evolve framework, which combines graph augmentation, data filtering, and model retraining to optimize graph classification models.
### Markdown representation of formulas:
- **Vertex similarity calculation**:
\[
s_{ij} = \sum_{z \in \Gamma(i) \cap \Gamma(j)} \frac{1}{d_z}, \quad S = \{s_{ij} | \forall (v_i, v_j) \in E_c^{\text{add}}\}
\]
\[
w_{ij}^{\text{add}} = \frac{s_{ij}}{\sum_{s \in S} s}, \quad W^{\text{add}} = \{w_{ij}^{\text{add}} | \forall (v_i, v_j) \in E_c^{\text{add}}\}
\]
- **Weight calculation for deleting edges**:
\[
w_{ij}^{\text{del}} = 1 - \frac{s_{ij}}{\sum_{s \in S} s}, \quad W^{\text{del}} = \{w_{ij}^{\text{del}} | \forall (v_i, v_j) \in E_c^{\text{del}}\}
\]
- **Calculation of label reliability threshold**:
\[
\theta = \arg \min_{\theta} \sum_{(G_i, y_i) \in D_{\text{val}}} \Phi[(\theta - r_i) \cdot g(G_i, y_i)]
\]
where,
\[
g(G_i, y_i) =
\begin{cases}
1 & \text{if } C(G_i) = y_i \\
-1 & \text{otherwise}
\end{cases}
\]
\[
\Phi(x) =
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{otherwise}
\end{cases}
\]
Through these methods, the paper successfully solved the problems of over - fitting and insufficient generalization in graph classification tasks and significantly improved the classification performance.