Semantic-aware Binary Code Representation with BERT

Hyungjoon Koo,Soyeon Park,Daejin Choi,Taesoo Kim
DOI: https://doi.org/10.48550/arXiv.2106.05478
2021-06-10
Abstract:A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code. Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary instead of manually crafting specifics of the analysis algorithm. However, the existing approaches utilizing machine learning are still specialized to solve one domain of problems, rendering recreation of models for different types of binary analysis. In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code. To this end, we introduce well-balanced instruction normalization that holds rich information for each of instructions yet minimizing an out-of-vocabulary (OOV) problem. DeepSemantic has been carefully designed based on our study with large swaths of binaries. Besides, DeepSemantic leverages the essence of the BERT architecture into re-purposing a pre-trained generic model that is readily available as a one-time processing, followed by quickly applying specific downstream tasks with a fine-tuning process. We demonstrate DeepSemantic with two downstream tasks, namely, binary similarity comparison and compiler provenance (i.e., compiler and optimization level) prediction. Our experimental results show that the binary similarity model outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, 49.84% and 15.83% on average, respectively.
Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of automatically recovering contextual meaning in binary code analysis. Specifically, the paper focuses on how to use machine - learning techniques, especially BERT - based deep - learning models, to generate representations that can reflect the semantics of binary code. Although existing methods have been able to solve some problems in specific fields, they usually need to recreate models for different binary analysis tasks, which is not only time - consuming but also inefficient. In addition, existing methods often fail to fully preserve the context information of the code when processing binary code, resulting in inaccurate analysis results. To overcome these challenges, the paper proposes a new framework named DeepSemantic. The main contributions of DeepSemantic are as follows: 1. **Semantically - aware binary code representation**: By introducing a well - balanced instruction normalization method, DeepSemantic can preserve the rich information of each instruction while minimizing the out - of - vocabulary (OOV) problem. This enables the model to more accurately capture the semantic information in binary code. 2. **Two - stage training model**: DeepSemantic adopts a two - stage training strategy. First, a general model (DS - Pre) is generated through pre - training, and then this pre - trained model is fine - tuned according to specific downstream tasks (DS - Task). This design not only improves the flexibility of the model but also reduces the computational resources required for retraining the model. 3. **Application in downstream tasks**: The paper shows the application of DeepSemantic in two specific downstream tasks - binary similarity comparison (DS - BinSim) and compiler origin prediction (DS - Toolchain). The experimental results show that DS - BinSim is significantly better than the two existing state - of - the - art tools, DeepBinDiff and SAFE, in binary similarity comparison, with an average improvement of 49.84% and 15.83% respectively. And in the compiler origin prediction task, DS - Toolchain also achieves very high F1 scores of 0.96 and 0.91 respectively. In conclusion, this paper aims to improve the accuracy and efficiency of binary code analysis by proposing a new deep - learning framework, DeepSemantic, so as to better support application scenarios such as vulnerability discovery, malware analysis, and code clone detection.