Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery

Iftakhar Ahmad,Lannan Luo

2024-04-30

Abstract:Binary code analysis has immense importance in the research domain of software security. Today, software is very often compiled for various Instruction Set Architectures (ISAs). As a result, cross-architecture binary code analysis has become an emerging problem. Recently, deep learning-based binary analysis has shown promising success. It is widely known that training a deep learning model requires a massive amount of data. However, for some low-resource ISAs, an adequate amount of data is hard to find, preventing deep learning from being widely adopted for binary analysis. To overcome the data scarcity problem and facilitate cross-architecture binary code analysis, we propose to apply the ideas and techniques in Neural Machine Translation (NMT) to binary code analysis. Our insight is that a binary, after disassembly, is represented in some assembly language. Given a binary in a low-resource ISA, we translate it to a binary in a high-resource ISA (e.g., x86). Then we can use a model that has been trained on the high-resource ISA to test the translated binary. We have implemented the model called UNSUPERBINTRANS, and conducted experiments to evaluate its performance. Specifically, we conducted two downstream tasks, including code similarity detection and vulnerability discovery. In both tasks, we achieved high accuracies.

Software Engineering,Cryptography and Security

What problem does this paper attempt to address?

The paper aims to address the issue of cross-architecture binary code analysis in the field of software security, particularly the challenge of applying deep learning due to data scarcity on low-resource Instruction Set Architectures (ISAs). Specifically, the paper proposes an unsupervised binary code translation model named UNSUPERBINTRANS, which translates binary code from low-resource ISAs to high-resource ISAs (such as x86). This allows the use of abundant training data available on high-resource ISAs for binary code analysis. This approach overcomes the data insufficiency problem in low-resource ISAs and facilitates cross-architecture binary code similarity detection and vulnerability discovery tasks. The main contributions of the paper include: 1. Proposing a novel unsupervised method for translating binary code between different ISAs, ensuring that the translated code retains similar semantics to the original code. 2. Implementing the UNSUPERBINTRANS model and evaluating it on two key binary analysis tasks, namely code similarity detection and vulnerability discovery. The results demonstrate that the model successfully captures code semantics and effectively translates binary code across ISAs. 3. Pioneering a new research direction for binary code analysis on low-resource ISAs, where models are trained on high-resource ISAs and then used to analyze code from other ISAs by translating their binary code to the high-resource ISA, thus addressing the data scarcity issue in low-resource ISAs. The paper also provides a detailed description of the model design, training process, experimental setup, and results, including the use of BLEU scores to measure translation quality, and the application effectiveness in downstream tasks such as function similarity comparison and vulnerability discovery.

Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations

Dynamic Malicious Code Detection Based on Binary Translator

BinDeep: A Deep Learning Approach to Binary Code Similarity Detection.

Jtrans: Jump-Aware Transformer for Binary Code Similarity Detection

jTrans: Jump-Aware Transformer for Binary Code Similarity

Binary code similarity analysis based on naming function and common vector space

IFAttn: Binary Code Similarity Analysis Based on Interpretable Features with Attention

UniBin: Assembly Semantic-enhanced Binary Vulnerability Detection without Disassembly

How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models

Cross-Language Binary-Source Code Matching with Intermediate Representations

FastBCSD: Fast and Efficient Neural Network for Binary Code Similarity Detection

Semantic aware-based instruction embedding for binary code similarity detection

Similarity-Based Source Code Vulnerability Detection Leveraging Transformer Architecture: Harnessing Cross- Attention for Hierarchical Analysis

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Machine Learning-Based Analysis of Program Binaries: A Comprehensive Study

BEDetector: A Two-Channel Encoding Method to Detect Vulnerabilities Based on Binary Similarity

Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis.

Cyber Vulnerability Intelligence for Internet of Things Binary

Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection