Abstract:Binary Code Embedding (BCE) has important applications in various reverse engineering tasks such as binary code similarity detection, type recovery, control-flow recovery and data-flow analysis. Recent studies have shown that the Transformer model can comprehend the semantics of binary code to support downstream tasks. However, existing models overlooked the prior knowledge of assembly language. In this paper, we propose a novel Transformer-based approach, namely kTrans, to generate knowledge-aware binary code embedding. By feeding explicit knowledge as additional inputs to the Transformer, and fusing implicit knowledge with a novel pre-training task, kTrans provides a new perspective to incorporating domain knowledge into a Transformer framework. We inspect the generated embeddings with outlier detection and visualization, and also apply kTrans to 3 downstream tasks: Binary Code Similarity Detection (BCSD), Function Type Recovery (FTR) and Indirect Call Recognition (ICR). Evaluation results show that kTrans can generate high-quality binary code embeddings, and outperforms state-of-the-art (SOTA) approaches on downstream tasks by 5.2%, 6.8%, and 12.6% respectively. kTrans is publicly available at: <a class="link-external link-https" href="https://github.com/Learner0x5a/kTrans-release" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key limitations of existing binary code embedding methods when dealing with disassembled languages: 1. **Lack of utilization of prior knowledge**: Most existing methods regard binary code as natural language and directly apply natural language models to assembly languages, ignoring the knowledge of instruction set architecture (ISA) in assembly languages, such as instruction opcode types, operand types, relationships between instructions, etc. For example, for registers `rax`, `eax`, `ax` and `al`, a natural language model will treat them as independent tokens, but in fact they are different parts of `rax`. If the model can understand the relationship between `eax` and `al`, it can better capture the data - flow relationships between instructions. 2. **Lack of understanding of instructions**: Existing methods lack a clear understanding of instruction boundaries and thus cannot model program execution behaviors. For example, PalmTree requires users to provide instruction boundaries, while BinBert completely lacks any information about instruction boundaries. This may lead to the model being unable to distinguish sequences such as `[‘pop’, ‘rbp’]` and `[‘rbp’, ‘pop’]`. 3. **Lack of modeling of implicit dependencies**: There are implicit dependencies between assembly instructions, such as the global flag register EFLAGS. Current methods partly solve implicit dependencies through manual design. For example, jTrans models instruction jump relationships by sharing the parameters of word embeddings and position embeddings, but lacks consideration of other dependencies. PalmTree models data - dependency relationships by constructing a next - sequence - prediction (NSP) task on the data - flow graph, but this sacrifices the ability to model the complete assembly - language context. To overcome these limitations, the paper proposes a new Transformer - based binary code embedding method - kTrans. kTrans incorporates the prior knowledge of assembly languages into the Transformer model by explicitly injecting token knowledge and implicitly injecting instruction knowledge, thereby improving the performance of the model in generating high - quality binary code embeddings and achieving significant improvements in downstream tasks. Specifically, the main contributions of kTrans include: - Proposing a new method to incorporate the prior knowledge of assembly languages into the Transformer model, which can model the implicit dependencies in assembly languages. - Verifying the superior performance of the generated embeddings in terms of anomaly detection accuracy, etc. through extensive experiments. - Discussing future research directions in the field of binary code embedding, including larger domain models, cost - effective models, and combination with general large - language models. - Open - sourcing kTrans to promote future research.

kTrans: Knowledge-Aware Transformer for Binary Code Embedding

jTrans: Jump-Aware Transformer for Binary Code Similarity

Jtrans: Jump-Aware Transformer for Binary Code Similarity Detection

A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills.

PromeTrans: Bootstrap binary functionality classification with knowledge transferred from pre-trained models

CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing

TransformCode: A Contrastive Learning Framework for Code Embedding Via Subtree Transformation

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

Dynamically Relative Position Encoding-Based Transformer for Automatic Code Edit

Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases

CLG-Trans: Contrastive Learning for Code Summarization Via Graph Attention-Based Transformer

TransA: an Adaptive Approach for Knowledge Graph Embedding

TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

TCKGE: Transformers with Contrastive Learning for Knowledge Graph Embedding

Structurally-Enhanced Approach for Automatic Code Transformation

Bidirectional transformer with knowledge graph for video captioning

Tracing Knowledge Instead of Patterns: Stable Knowledge Tracing with Diagnostic Transformer

Codeformer: A GNN-Nested Transformer Model for Binary Code Similarity Detection

TransforLearn: Interactive Visual Tutorial for the Transformer Model.

Code Structure–Guided Transformer for Source Code Summarization