kTrans: Knowledge-Aware Transformer for Binary Code Embedding

Wenyu Zhu,Hao Wang,Yuchen Zhou,Jiaming Wang,Zihan Sha,Zeyu Gao,Chao Zhang
2023-08-24
Abstract:Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at: <a class="link-external link-https" href="https://github.com/Learner0x5a/kTrans-release" rel="external noopener nofollow">this https URL</a>
Software Engineering,Artificial Intelligence,Cryptography and Security
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several key limitations of existing binary code embedding methods when dealing with disassembled languages: 1. **Lack of utilization of prior knowledge**: Most existing methods regard binary code as natural language and directly apply natural language models to assembly languages, ignoring the knowledge of instruction set architecture (ISA) in assembly languages, such as instruction opcode types, operand types, relationships between instructions, etc. For example, for registers `rax`, `eax`, `ax` and `al`, a natural language model will treat them as independent tokens, but in fact they are different parts of `rax`. If the model can understand the relationship between `eax` and `al`, it can better capture the data - flow relationships between instructions. 2. **Lack of understanding of instructions**: Existing methods lack a clear understanding of instruction boundaries and thus cannot model program execution behaviors. For example, PalmTree requires users to provide instruction boundaries, while BinBert completely lacks any information about instruction boundaries. This may lead to the model being unable to distinguish sequences such as `[‘pop’, ‘rbp’]` and `[‘rbp’, ‘pop’]`. 3. **Lack of modeling of implicit dependencies**: There are implicit dependencies between assembly instructions, such as the global flag register EFLAGS. Current methods partly solve implicit dependencies through manual design. For example, jTrans models instruction jump relationships by sharing the parameters of word embeddings and position embeddings, but lacks consideration of other dependencies. PalmTree models data - dependency relationships by constructing a next - sequence - prediction (NSP) task on the data - flow graph, but this sacrifices the ability to model the complete assembly - language context. To overcome these limitations, the paper proposes a new Transformer - based binary code embedding method - kTrans. kTrans incorporates the prior knowledge of assembly languages into the Transformer model by explicitly injecting token knowledge and implicitly injecting instruction knowledge, thereby improving the performance of the model in generating high - quality binary code embeddings and achieving significant improvements in downstream tasks. Specifically, the main contributions of kTrans include: - Proposing a new method to incorporate the prior knowledge of assembly languages into the Transformer model, which can model the implicit dependencies in assembly languages. - Verifying the superior performance of the generated embeddings in terms of anomaly detection accuracy, etc. through extensive experiments. - Discussing future research directions in the field of binary code embedding, including larger domain models, cost - effective models, and combination with general large - language models. - Open - sourcing kTrans to promote future research.