M-Evolve: Structural-Mapping-Based Data Augmentation for Graph Classification

Jiajun Zhou,Jie Shen,Shanqing Yu,Guanrong Chen,Qi Xuan
DOI: https://doi.org/10.1109/TNSE.2020.3032950
2021-04-03
Abstract:Graph classification, which aims to identify the category labels of graphs, plays a significant role in drug classification, toxicity detection, protein analysis etc. However, the limitation of scale in the benchmark datasets makes it easy for graph classification models to fall into over-fitting and undergeneralization. To improve this, we introduce data augmentation on graphs (i.e. graph augmentation) and present four methods:random mapping, vertex-similarity mapping, motif-random mapping and motif-similarity mapping, to generate more weakly labeled data for small-scale benchmark datasets via heuristic transformation of graph structures. Furthermore, we propose a generic model evolution framework, named M-Evolve, which combines graph augmentation, data filtration and model retraining to optimize pre-trained graph classifiers. Experiments on six benchmark datasets demonstrate that the proposed framework helps existing graph classification models alleviate over-fitting and undergeneralization in the training on small-scale benchmark datasets, which successfully yields an average improvement of 3 - 13% accuracy on graph classification tasks.
Machine Learning,Social and Information Networks
What problem does this paper attempt to address?
This paper attempts to solve the problems of over - fitting and insufficient generalization in graph classification tasks. Specifically, due to the limited size of existing benchmark datasets, graph classification models are prone to over - fitting and insufficient generalization. To solve this problem, the author introduced graph augmentation techniques and proposed four graph augmentation methods: random mapping, vertex - similarity mapping, motif - random mapping, and motif - similarity mapping. These methods heuristically modify and transform the graph structure to generate more weakly - labeled data, thereby expanding the size of small - scale datasets. In addition, the author also proposed a general model evolution framework M - Evolve, which combines graph augmentation, data filtering, and model retraining to optimize pre - trained graph classifiers. Experimental results show that the M - Evolve framework can significantly improve the performance of existing graph classification models on small - scale benchmark datasets, with an average increase in classification accuracy of 3% to 13%. ### Summary of the core issues in the paper: 1. **Over - fitting and insufficient generalization**: Due to the small size of existing graph classification datasets, models are prone to over - fitting, resulting in insufficient generalization ability. 2. **Data augmentation**: Generate more weakly - labeled data through graph augmentation techniques to expand the size of the training dataset. 3. **Model optimization**: Propose the M - Evolve framework, which combines graph augmentation, data filtering, and model retraining to optimize graph classification models. ### Markdown representation of formulas: - **Vertex similarity calculation**: \[ s_{ij} = \sum_{z \in \Gamma(i) \cap \Gamma(j)} \frac{1}{d_z}, \quad S = \{s_{ij} | \forall (v_i, v_j) \in E_c^{\text{add}}\} \] \[ w_{ij}^{\text{add}} = \frac{s_{ij}}{\sum_{s \in S} s}, \quad W^{\text{add}} = \{w_{ij}^{\text{add}} | \forall (v_i, v_j) \in E_c^{\text{add}}\} \] - **Weight calculation for deleting edges**: \[ w_{ij}^{\text{del}} = 1 - \frac{s_{ij}}{\sum_{s \in S} s}, \quad W^{\text{del}} = \{w_{ij}^{\text{del}} | \forall (v_i, v_j) \in E_c^{\text{del}}\} \] - **Calculation of label reliability threshold**: \[ \theta = \arg \min_{\theta} \sum_{(G_i, y_i) \in D_{\text{val}}} \Phi[(\theta - r_i) \cdot g(G_i, y_i)] \] where, \[ g(G_i, y_i) = \begin{cases} 1 & \text{if } C(G_i) = y_i \\ -1 & \text{otherwise} \end{cases} \] \[ \Phi(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} \] Through these methods, the paper successfully solved the problems of over - fitting and insufficient generalization in graph classification tasks and significantly improved the classification performance.