Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery

Iftakhar Ahmad,Lannan Luo
2024-04-30
Abstract:Binary code analysis has immense importance in the research domain of software security. Today, software is very often compiled for various Instruction Set Architectures (ISAs). As a result, cross-architecture binary code analysis has become an emerging problem. Recently, deep learning-based binary analysis has shown promising success. It is widely known that training a deep learning model requires a massive amount of data. However, for some low-resource ISAs, an adequate amount of data is hard to find, preventing deep learning from being widely adopted for binary analysis. To overcome the data scarcity problem and facilitate cross-architecture binary code analysis, we propose to apply the ideas and techniques in Neural Machine Translation (NMT) to binary code analysis. Our insight is that a binary, after disassembly, is represented in some assembly language. Given a binary in a low-resource ISA, we translate it to a binary in a high-resource ISA (e.g., x86). Then we can use a model that has been trained on the high-resource ISA to test the translated binary. We have implemented the model called UNSUPERBINTRANS, and conducted experiments to evaluate its performance. Specifically, we conducted two downstream tasks, including code similarity detection and vulnerability discovery. In both tasks, we achieved high accuracies.
Software Engineering,Cryptography and Security
What problem does this paper attempt to address?
The paper aims to address the issue of cross-architecture binary code analysis in the field of software security, particularly the challenge of applying deep learning due to data scarcity on low-resource Instruction Set Architectures (ISAs). Specifically, the paper proposes an unsupervised binary code translation model named UNSUPERBINTRANS, which translates binary code from low-resource ISAs to high-resource ISAs (such as x86). This allows the use of abundant training data available on high-resource ISAs for binary code analysis. This approach overcomes the data insufficiency problem in low-resource ISAs and facilitates cross-architecture binary code similarity detection and vulnerability discovery tasks. The main contributions of the paper include: 1. Proposing a novel unsupervised method for translating binary code between different ISAs, ensuring that the translated code retains similar semantics to the original code. 2. Implementing the UNSUPERBINTRANS model and evaluating it on two key binary analysis tasks, namely code similarity detection and vulnerability discovery. The results demonstrate that the model successfully captures code semantics and effectively translates binary code across ISAs. 3. Pioneering a new research direction for binary code analysis on low-resource ISAs, where models are trained on high-resource ISAs and then used to analyze code from other ISAs by translating their binary code to the high-resource ISA, thus addressing the data scarcity issue in low-resource ISAs. The paper also provides a detailed description of the model design, training process, experimental setup, and results, including the use of BLEU scores to measure translation quality, and the application effectiveness in downstream tasks such as function similarity comparison and vulnerability discovery.