Rep2Vec: Repository Embedding Via Heterogeneous Graph Adversarial Contrastive Learning

Yiyue Qian,Yiming Zhang,Qianlong Wen,Yanfang Ye,Chuxu Zhang
DOI: https://doi.org/10.1145/3534678.3539324
2022-01-01
Abstract:Driven by the exponential increase of software and the advent of the pull-based development system Git, a large amount of open-source software has emerged on various social coding platforms. GitHub, as the largest platform, not only attracts developers and researchers to contribute legitimate software and research-related source code but has also become a popular platform for an increasing number of cybercriminals to perform continuous cyberattacks. Hence, some tools have been developed to learn representations of repositories on GitHub for various related applications (e.g., malicious repository detection) recently. However, most of them merely focus on code content while ignoring the rich relational data among repositories. In addition, they usually require a mass of resources to obtain sufficient labeled data for model training while ignoring the usefully handy unlabeled data. To this end, we propose a novel model Rep2Vec which integrates the code content, the structural relations, and the unlabeled data to learn the repository representations. First, to comprehensively model the repository data, we build a repository heterogeneous graph (Rep-HG) which is encoded by a graph neural network. Afterwards, to fully exploit unlabeled data in Rep-HG, we introduce adversarial attacks to generate more challenging contrastive pairs for the contrastive learning module to train the encoder in node view and meta-path view simultaneously. To alleviate the workload of the encoder against attacks, we further design a dual-stream contrastive learning module that integrates contrastive learning on adversarial graph and original graph together. Finally, the pre-trained encoder is fine-tuned to the downstream task, and further enhanced by a knowledge distillation module. Extensive experiments on the collected dataset from GitHub demonstrate the effectiveness of Rep2Vec in comparison with state-of-the-art methods for multiple repository tasks.
What problem does this paper attempt to address?