PDG2Vec: Identify the Binary Function Similarity with Program Dependence Graph

Yuntao Zhang,Yanhao Wang,Yuwei Liu,Zhengyuan Pang,Binxing Fang
DOI: https://doi.org/10.1109/qrs57517.2022.00061
2022-01-01
Abstract:Binary code similarity identification is an important technique applied to many security applications (e.g., plagiarism detection, bug search). The primary challenge of this research topic is how to extract sufficient information from the binary code for similarity comparison. Although numerous approaches have been proposed to address the challenge, most of them leverage features determined by human experience or extracted using machine learning methods and ignore some critical technique semantic information. Additionally, they assess their approach exclusively in laboratory environments and lack real-world datasets. Both problems lead to the limited effectiveness of these methods in real application scenarios (e.g., vulnerable function search).In this paper, we propose a novel approach PDG2Vec, which extracts the data dependence graph and control dependence graph (i.e., program dependence graph (PDG)) as the features of functions and uses them for identifying function similarity. Meanwhile, we design several strategies to optimize the PDG’s construction and use them in similarity comparison to balance time-consuming and accuracy. We implement the prototype of PDG2Vec, which can perform binary code similarity comparison across architectures of x86, x86_64, MIPS32, ARM32, and ARM64. We evaluate PDG2Vec with two datasets. The experimental results show that PDG2Vec is resilient to cross-architecture and extracts more precise semantics than other approaches. Moreover, PDG2Vec outperforms the state-of-the-art tools in the vulnerable function search scenario and has excellent performance.
What problem does this paper attempt to address?