Data Augmentation for Supervised Graph Outlier Detection via Latent Diffusion Models

Kay Liu,Hengrui Zhang,Ziqing Hu,Fangxin Wang,Philip S. Yu
2024-11-23
Abstract:A fundamental challenge confronting supervised graph outlier detection algorithms is the prevalent problem of class imbalance, where the scarcity of outlier instances compared to normal instances often results in suboptimal performance. Recently, generative models, especially diffusion models, have demonstrated their efficacy in synthesizing high-fidelity images. Despite their extraordinary generation quality, their potential in data augmentation for supervised graph outlier detection remains largely underexplored. To bridge this gap, we introduce GODM, a novel data augmentation for mitigating class imbalance in supervised Graph Outlier detection via latent Diffusion Models. Extensive experiments conducted on multiple datasets substantiate the effectiveness and efficiency of GODM. The case study further demonstrated the generation quality of our synthetic data. To foster accessibility and reproducibility, we encapsulate GODM into a plug-and-play package and release it at PyPI: <a class="link-external link-https" href="https://pypi.org/project/godm/" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Social and Information Networks
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the class imbalance problem in supervised graph outlier detection. Specifically: - **Class Imbalance Problem**: In supervised graph outlier detection, the number of normal instances (inliers) is much larger than that of abnormal instances (outliers). This imbalance causes the model to be biased towards normal instances during training, thus reducing the detection performance for abnormal instances. For example, in the DGraph dataset, the ratio of positive to negative samples is only 1:85, which reflects the extreme ratio in scenarios such as financial fraud detection in the real world. - **Limitations of Existing Methods**: - **Upsampling and Downsampling**: These methods relieve the imbalance problem by replicating the minority class or reducing the majority class respectively, but there are risks of over - fitting or losing valuable training data. - **Instance Reweighting in the Loss Function**: Adjusting the loss function by giving abnormal instances greater weights, but like upsampling and downsampling, problems still exist. To solve these problems, the paper introduces a data augmentation method based on latent diffusion models (LDM) - GODM (Graph Outlier Detection via Latent Diffusion Models). This method aims to generate synthetic abnormal instances to balance the class distribution in the training data, thereby improving the performance of graph outlier detectors. ### Main Contributions of GODM 1. **Generate High - Quality Synthetic Data**: Perform data augmentation in the latent space through the diffusion model to generate realistic abnormal nodes. 2. **Handle Heterogeneous Graph Data**: In view of the complexity and heterogeneity of graph data, propose a variational encoder to map different types of graph information to a unified latent space. 3. **Improve Computational Efficiency**: Adopt negative sampling and graph clustering techniques to reduce computational costs, enabling GODM to run efficiently on large - scale graph data. 4. **Conditional Generation**: Only generate abnormal nodes to ensure that the generated data meets the requirements of the task. ### Experimental Results The paper verifies the effectiveness and efficiency of GODM through experiments on multiple datasets, especially showing excellent performance in metrics such as AUC, AP, and Recall, significantly improving the performance of graph outlier detection. In conclusion, by introducing GODM, this paper provides an effective method to alleviate the class imbalance problem in supervised graph outlier detection, thereby improving the detection performance.