Abstract:Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. As IoT devices proliferate and rapidly evolve, their highly heterogeneous hardware architectures and complex compilation settings, coupled with the demand for large-scale function retrieval in practical applications, put forward higher requirements for BCSD methods. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction, and integrates a pre-trained language model with a graph neural network to capture both semantic and structural information from different perspectives. By introducing momentum contrastive learning, it effectively enhances retrieval capabilities in large-scale candidate function sets, distinguishing between subtle function similarities and differences. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges in binary code similarity detection (BCSD). Specifically, the paper proposes a new method - IRBinDiff - to address the following two main issues: 1. **Differences Caused by Complex Compilation Options**: - With the rapid development and diversification of Internet of Things (IoT) devices, different devices have highly heterogeneous hardware architectures and complex compilation settings. These factors result in binary files compiled from the same source code may be completely different in form. - Therefore, the BCSD method needs to be robust enough to handle these complex and diverse compilation environments. 2. **Large - scale Candidate Function Retrieval**: - In practical applications, especially in large - scale firmware analysis, it is necessary to retrieve a small number of similar functions from a large number of irrelevant functions. This requires that the BCSD method can capture subtle semantic differences in large - scale datasets. - The paper improves the model's retrieval ability in large - scale candidate function sets by introducing momentum contrastive learning. ### Method Overview To address the above challenges, IRBinDiff adopts the following strategies: - **Using LLVM Intermediate Representation (LLVM - IR)**: Promote binary files to LLVM - IR, thereby providing a higher - level semantic abstraction and reducing the differences caused by underlying hardware and compilation options. - **Combining Pretrained Language Models and Graph Neural Networks**: Utilize pretrained language models to capture semantic information and encode control - flow graphs through graph neural networks to extract structural information. - **Introducing Momentum Contrastive Learning**: By maintaining a dynamic queue containing a large number of negative samples, the model can learn from more diverse negative samples and effectively distinguish semantically similar functions from irrelevant functions. ### Experimental Results The paper verifies the effectiveness of IRBinDiff through extensive experiments. The experimental results show that IRBinDiff outperforms other existing BCSD methods in both one - to - one comparison and one - to - many search scenarios, especially in tasks across compilers, across optimization levels, and across architectures. Through these improvements, IRBinDiff significantly improves the accuracy and robustness of binary code similarity detection and is suitable for various complex practical application scenarios.

Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations

BinCola: Diversity-sensitive Contrastive Learning for Binary Code Similarity Detection

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

$\\alpha$ Diff: Cross-Version Binary Code Similarity Detection with DNN

FastBCSD: Fast and Efficient Neural Network for Binary Code Similarity Detection

<tex>$\alpha$</tex> Diff: Cross-Version Binary Code Similarity Detection with DNN

IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block Relations

Semantic aware-based instruction embedding for binary code similarity detection

Hierarchical Attention Graph Embedding Networks for Binary Code Similarity against Compilation Diversity

DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection

SemDiff: Binary Similarity Detection by Diffing Key-Semantics Graphs

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Inter-BIN: Interaction-based Cross-architecture IoT Binary Similarity Comparison

Efficiently Identifying Binary Similarity Based on Deep Hashing and Contrastive Learning

Binary code similarity analysis based on naming function and common vector space

StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation

Understanding the AI-powered Binary Code Similarity Detection

SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings

GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching

Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery

IFAttn: Binary Code Similarity Analysis Based on Interpretable Features with Attention