Abstract:A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary instead of manually crafting specifics of the analysis algorithm. However, the existing approaches utilizing machine learning are still specialized to solve one domain of problems, rendering recreation of models for different types of binary analysis. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code. To this end, we introduce well-balanced instruction normalization that holds rich information for each of instructions yet minimizing an out-of-vocabulary (OOV) problem. DeepSemantic has been carefully designed based on our study with large swaths of binaries. Besides, DeepSemantic leverages the essence of the BERT architecture into re-purposing a pre-trained generic model that is readily available as a one-time processing, followed by quickly applying specific downstream tasks with a fine-tuning process. We demonstrate DeepSemantic with two downstream tasks, namely, binary similarity comparison and compiler provenance (i.e., compiler and optimization level) prediction. Our experimental results show that the binary similarity model outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, 49.84% and 15.83% on average, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of automatically recovering contextual meaning in binary code analysis. Specifically, the paper focuses on how to use machine - learning techniques, especially BERT - based deep - learning models, to generate representations that can reflect the semantics of binary code. Although existing methods have been able to solve some problems in specific fields, they usually need to recreate models for different binary analysis tasks, which is not only time - consuming but also inefficient. In addition, existing methods often fail to fully preserve the context information of the code when processing binary code, resulting in inaccurate analysis results. To overcome these challenges, the paper proposes a new framework named DeepSemantic. The main contributions of DeepSemantic are as follows: 1. **Semantically - aware binary code representation**: By introducing a well - balanced instruction normalization method, DeepSemantic can preserve the rich information of each instruction while minimizing the out - of - vocabulary (OOV) problem. This enables the model to more accurately capture the semantic information in binary code. 2. **Two - stage training model**: DeepSemantic adopts a two - stage training strategy. First, a general model (DS - Pre) is generated through pre - training, and then this pre - trained model is fine - tuned according to specific downstream tasks (DS - Task). This design not only improves the flexibility of the model but also reduces the computational resources required for retraining the model. 3. **Application in downstream tasks**: The paper shows the application of DeepSemantic in two specific downstream tasks - binary similarity comparison (DS - BinSim) and compiler origin prediction (DS - Toolchain). The experimental results show that DS - BinSim is significantly better than the two existing state - of - the - art tools, DeepBinDiff and SAFE, in binary similarity comparison, with an average improvement of 49.84% and 15.83% respectively. And in the compiler origin prediction task, DS - Toolchain also achieves very high F1 scores of 0.96 and 0.91 respectively. In conclusion, this paper aims to improve the accuracy and efficiency of binary code analysis by proposing a new deep - learning framework, DeepSemantic, so as to better support application scenarios such as vulnerability discovery, malware analysis, and code clone detection.

Semantic-aware Binary Code Representation with BERT

Semantic aware-based instruction embedding for binary code similarity detection

DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection

Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis.

Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

BinEnhance: An Enhancement Framework Based on External Environment Semantics for Binary Code Search

A Semantics-Based Hybrid Approach on Binary Code Similarity Comparison

How Far Have We Gone in Binary Code Understanding Using Large Language Models

On Training a Neural Network to Explain Binaries

SemDiff: Binary Similarity Detection by Diffing Key-Semantics Graphs

FastBCSD: Fast and Efficient Neural Network for Binary Code Similarity Detection

Deep Semantic Feature Learning for Software Defect Prediction

How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models

Software Ethology: An Accurate, Resilient, and Cross-Architecture Binary Analysis Framework

Understanding the AI-powered Binary Code Similarity Detection

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Leveraging Artificial Intelligence on Binary Code Comprehension

DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode

IFAttn: Binary Code Similarity Analysis Based on Interpretable Features with Attention

Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery