ARCTURUS : Full Coverage Binary Similarity Analysis with Reachability-Guided Emulation

Anshunkang Zhou,Yikun Hu,Xiangzhe Xu,Charles Zhang
DOI: https://doi.org/10.1145/3640337
IF: 3.685
2024-01-11
ACM Transactions on Software Engineering and Methodology
Abstract:Binary code similarity analysis is extremely useful since it provides rich information about an unknown binary, such as revealing its functionality and identifying reused libraries. Robust binary similarity analysis is challenging as heavy compiler optimizations can make semantically similar binaries have gigantic syntactic differences. Unfortunately, existing semantic-based methods still suffer from either incomplete coverage or low accuracy. In this paper, we propose ARCTURUS , a new technique that can achieve high code coverage and high accuracy simultaneously by manipulating program execution under the guidance of code reachability. Our key insight is that the compiler must preserve program semantics (e.g., dependences between code fragments) during compilation; therefore, the code reachability, which implies the interdependence between code, is invariant across code transformations. Based on the above insight, our key idea is to leverage the stability of code reachability to manipulate the program execution such that deep code logic can also be covered in a consistent way. Experimental results show that ARCTURUS achieves an average precision of 87.8% with 100% block coverage, outperforming compared methods by 38.4% on average. ARCTURUS takes only 0.15 seconds to process one function on average, indicating that it is efficient for practical use.
computer science, software engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the incomplete coverage and low accuracy in binary code similarity analysis. Specifically, the existing semantic - based methods either have the problem of incomplete coverage or perform poorly in terms of accuracy. The paper proposes a new technique - ARCTURUS, which manipulates program execution by leveraging the stability of code reachability, thereby achieving simultaneous improvements in high coverage and high accuracy. ### Main Contributions 1. **Dynamic Binary Similarity Analysis Framework**: Developed a new dynamic binary similarity analysis framework that can achieve both complete code coverage and accurate analysis results, using a novel reachability - guided emulation technique. 2. **Formal Proof**: Formally proved that reachability - guided emulation can produce the same execution results between semantically equivalent binary files. 3. **Prototype Implementation and Evaluation**: Implemented a prototype system named ARCTURUS and conducted a large - scale evaluation on 820,021 functions from 16 real - world projects. The experimental results show that the precision of ARCTURUS in function matching reaches 87.8%, which is 38.4% higher on average than existing methods. In addition, the average time it takes to process a function is only 0.15 seconds, indicating its efficiency in practical applications. ### Background and Motivation Binary code similarity analysis is very important in various security applications, such as vulnerability discovery, malware detection, patch analysis, forensics, and component analysis. However, due to compiler optimizations, semantically similar binary files may have significant syntactic differences, so existing methods have deficiencies in both coverage and accuracy. ### Technical Details 1. **Reachability - Guided Emulation**: The core idea of ARCTURUS is to use the stability of code reachability to manipulate program execution, ensuring that deep - level code logic can also be covered in a consistent manner. Compilers must preserve program semantics (e.g., dependencies between code fragments) during the compilation process, so code reachability is invariant during code transformation. 2. **Pre - processing and Lifting**: The target binary code is first lifted to the LLVM intermediate representation (IR) and further re - optimized through the optimizations provided by LLVM to significantly reduce the number of instructions and improve emulation efficiency. 3. **Emulation Engine**: ARCTURUS captures various runtime behaviors (input / output values) and uses them as semantic features for similarity comparison. The emulation engine simulates each function using a predefined set of parameter values and dynamically enforces certain branch results under the guidance of reachability to cover all intra - procedural blocks. 4. **Similarity Comparison**: The extracted target code features are compared with the reference features in the reference function pool to search for the most similar function. The Jaccard inclusion similarity is used to calculate the similarity score between two functions: \[ S(f_1, f_2)=\frac{\vert f_1\cap f_2\vert}{\vert f_1\cup f_2\vert} \] where \( f_1 \) and \( f_2 \) represent the semantic features of the two functions respectively. The score ranges from 0 to 1, and the closer the score is to 1, the more similar the two functions are. ### Experimental Results The experimental results show that the precision of ARCTURUS in function matching reaches 87.8%, which is 38.4% higher on average than existing methods. In addition, the average time it takes to process a function is only 0.15 seconds, indicating its efficiency in practical applications. Extensive case studies further demonstrate its practical applications in known vulnerability detection and binary version identification.