Abstract:Driven by the exponential increase of software and the advent of the pull-based development system Git, a large amount of open-source software has emerged on various social coding platforms. GitHub, as the largest platform, not only attracts developers and researchers to contribute legitimate software and research-related source code but has also become a popular platform for an increasing number of cybercriminals to perform continuous cyberattacks. Hence, some tools have been developed to learn representations of repositories on GitHub for various related applications (e.g., malicious repository detection) recently. However, most of them merely focus on code content while ignoring the rich relational data among repositories. In addition, they usually require a mass of resources to obtain sufficient labeled data for model training while ignoring the usefully handy unlabeled data. To this end, we propose a novel model Rep2Vec which integrates the code content, the structural relations, and the unlabeled data to learn the repository representations. First, to comprehensively model the repository data, we build a repository heterogeneous graph (Rep-HG) which is encoded by a graph neural network. Afterwards, to fully exploit unlabeled data in Rep-HG, we introduce adversarial attacks to generate more challenging contrastive pairs for the contrastive learning module to train the encoder in node view and meta-path view simultaneously. To alleviate the workload of the encoder against attacks, we further design a dual-stream contrastive learning module that integrates contrastive learning on adversarial graph and original graph together. Finally, the pre-trained encoder is fine-tuned to the downstream task, and further enhanced by a knowledge distillation module. Extensive experiments on the collected dataset from GitHub demonstrate the effectiveness of Rep2Vec in comparison with state-of-the-art methods for multiple repository tasks.

Malicious Repositories Detection with Adversarial Heterogeneous Graph Contrastive Learning

Heterogeneous Graph Neural Networks for Malicious Account Detection

Rep2Vec: Repository Embedding Via Heterogeneous Graph Adversarial Contrastive Learning

Hierarchical Semi-supervised Contrastive Learning for Contamination-Resistant Anomaly Detection

Debiased Graph Contrastive Learning.

Adapting Meta Knowledge with Heterogeneous Information Network for COVID-19 Themed Malicious Repository Detection.

Homophily-Driven Sanitation View for Robust Graph Contrastive Learning

GCCAD: Graph Contrastive Learning for Anomaly Detection

Detecting Malicious Accounts in Online Developer Communities Using Deep Learning

GCCAD: Graph Contrastive Coding for Anomaly Detection

Similarity Preserving Adversarial Graph Contrastive Learning

A Heterogeneous Graph Learning Model for Cyber-Attack Detection

BotCL: a social bot detection model based on graph contrastive learning

Generative-Enhanced Heterogeneous Graph Contrastive Learning

MalGraph: Hierarchical Graph Neural Networks for Robust Windows Malware Detection

Adversarial Attacks on Code Models with Discriminative Graph Patterns

On the Adversarial Robustness of Graph Contrastive Learning Methods

Hypergraph Contrastive Learning for Drug Trafficking Community Detection

Heterogeneous Graph Contrastive Learning With Augmentation Graph

LAMP: Learnable Meta-Path Guided Adversarial Contrastive Learning for Heterogeneous Graphs

Certifiably Robust Graph Contrastive Learning