Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations

Xiuwei Shang,Li Hu,Shaoyin Cheng,Guoqiang Chen,Benlong Wu,Weiming Zhang,Nenghai Yu
2024-10-24
Abstract:Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. As IoT devices proliferate and rapidly evolve, their highly heterogeneous hardware architectures and complex compilation settings, coupled with the demand for large-scale function retrieval in practical applications, put forward higher requirements for BCSD methods. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction, and integrates a pre-trained language model with a graph neural network to capture both semantic and structural information from different perspectives. By introducing momentum contrastive learning, it effectively enhances retrieval capabilities in large-scale candidate function sets, distinguishing between subtle function similarities and differences. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
Software Engineering
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges in binary code similarity detection (BCSD). Specifically, the paper proposes a new method - IRBinDiff - to address the following two main issues: 1. **Differences Caused by Complex Compilation Options**: - With the rapid development and diversification of Internet of Things (IoT) devices, different devices have highly heterogeneous hardware architectures and complex compilation settings. These factors result in binary files compiled from the same source code may be completely different in form. - Therefore, the BCSD method needs to be robust enough to handle these complex and diverse compilation environments. 2. **Large - scale Candidate Function Retrieval**: - In practical applications, especially in large - scale firmware analysis, it is necessary to retrieve a small number of similar functions from a large number of irrelevant functions. This requires that the BCSD method can capture subtle semantic differences in large - scale datasets. - The paper improves the model's retrieval ability in large - scale candidate function sets by introducing momentum contrastive learning. ### Method Overview To address the above challenges, IRBinDiff adopts the following strategies: - **Using LLVM Intermediate Representation (LLVM - IR)**: Promote binary files to LLVM - IR, thereby providing a higher - level semantic abstraction and reducing the differences caused by underlying hardware and compilation options. - **Combining Pretrained Language Models and Graph Neural Networks**: Utilize pretrained language models to capture semantic information and encode control - flow graphs through graph neural networks to extract structural information. - **Introducing Momentum Contrastive Learning**: By maintaining a dynamic queue containing a large number of negative samples, the model can learn from more diverse negative samples and effectively distinguish semantically similar functions from irrelevant functions. ### Experimental Results The paper verifies the effectiveness of IRBinDiff through extensive experiments. The experimental results show that IRBinDiff outperforms other existing BCSD methods in both one - to - one comparison and one - to - many search scenarios, especially in tasks across compilers, across optimization levels, and across architectures. Through these improvements, IRBinDiff significantly improves the accuracy and robustness of binary code similarity detection and is suitable for various complex practical application scenarios.