Malicious Repositories Detection with Adversarial Heterogeneous Graph Contrastive Learning

Yiyue Qian,Yiming Zhang,Nitesh Chawla,Yanfang Ye,Chuxu Zhang
DOI: https://doi.org/10.1145/3511808.3557384
2022-01-01
Abstract:GitHub, as the largest social coding platform, has attracted an increasing number of cybercriminals to disseminate malware by posting malicious code repositories. To address the imminent problem, some tools were developed to detect malicious repositories based on the code content. However, most of them ignore the rich relational information among repositories and usually require abundant labeled data to train the model. To this end, one effective way is to exploit unlabeled data to pre-train a model which considers both structural relation and code content of repositories, and further transfer the pre-trained model to the downstream tasks with labeled repository data. In this paper, we propose a novel model adversarial contrastive learning on heterogeneous graph (CLA-HG) to detect malicious repository in GitHub. First of all, CLA-HG builds a heterogeneous graph (HG) to comprehensively model repository data. Afterwards, to exploit unlabeled information in HG, CLA-HG introduces a dual-stream graph contrastive learning mechanism that distinguishes both adversarial subgraph pairs and standard subgraph pairs to pre-train graph neural networks using unlabeled data. Finally, the pre-trained model is fine-tuned to the downstream malicious repository detection task enhanced by a knowledge distillation (KD) module. Extensive experiments on two collected datasets from GitHub demonstrate the effectiveness of CLA-HG in comparison with state-of-the-art methods and popular commercial anti-malware products.
What problem does this paper attempt to address?