GraphMoCo: A Graph Momentum Contrast Model for Large-Scale Binary Function Representation Learning

Runjin Sun,Shize Guo,Jinhong Guo,Li Wei,Xingyu Zhang,Guo Xi,Zhisong Pan
DOI: https://doi.org/10.1016/j.neucom.2024.127273
IF: 6
2024-01-01
Neurocomputing
Abstract:In the field of cybersecurity, the ability to compute similarity scores at the function level for binary code is of utmost importance. Considering that a single binary file may contain an extensive amount of functions, an effective learning framework must exhibit both high accuracy and efficiency when handling substantial volumes of data. Nonetheless, conventional methods encounter several limitations. Firstly, accurately annotating different pairs of functions with appropriate labels poses a significant challenge, thereby making it difficult to employ supervised learning methods without risk of overtraining. Secondly, while SOTA models often rely on pre-trained encoders or fine-grained graph comparison techniques, these approaches suffer from drawbacks related to time and memory consumption. Thirdly, the momentum update algorithm utilized in graph-based contrastive learning models can result in information leakage. Surprisingly, none of the existing articles address this issue. This research focuses on addressing the challenges associated with large-scale Binary Code Similarity Detection (BCSD). To overcome the aforementioned problems, we propose GraphMoCo: a graph momentum contrast model that leverages multimodal structural information for efficient binary function representation learning on a large scale. We adopt an unsupervised learning strategy. Our approach eliminates the need for manual labeling. By leveraging the intrinsic structural information at multiple levels of the binary code, our model could achieve higher accuracy with a simple CNN-based model. By introducing the preshuffle mechanism, the issue of information leakage in graph momentum update algorithm is mitigated. The evaluation results indicate that GraphMoCo exhibits superior performance compared to SOTA approaches in the function pair search task, showing an average improvement of 7% on AUC, and 10% on MRR and Recall@1. Furthermore, GraphMoCo achieves a MAP of 0.93 on the more challenging dataset 2, which comprises a larger function pool. In a real-world scenario, specifically in known vulnerability searching, GraphMoCo achieves a MRR that surpasses existing SOTA models by 5%.
What problem does this paper attempt to address?