Abstract:Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code. This approach allows for virtual compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large dataset for assembly code search. Employing this extensive dataset, we achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.

What problem does this paper attempt to address?

The paper primarily aims to address the problem of quickly locating specific functions in binary programs, especially in decompilation scenarios. Specifically, the research focuses on the following key points: 1. **Challenge the Status Quo**: Traditional decompilation engineers typically rely on finding unique strings or constants to locate code segments with specific functions. This method is inefficient and time-consuming. 2. **Objective**: Develop a method that allows users to search for specific functions in binary files through natural language descriptions, thereby improving interactivity and search efficiency. 3. **Main Contributions**: - **Introduction of the Virtual Compiler (ViC)**: Utilizing large language models (LLM) to simulate the behavior of general compilers, achieving a virtual compilation process from source code to assembly code. This method is not only applicable to C/C++ languages but can also be extended to other programming languages such as Python and Golang. - **Improvement in Assembly Code Search Performance**: Constructed a high-quality assembly code dataset and trained a model that improved performance by 26% in assembly code search tasks compared to existing state-of-the-art technologies. - **Resource Sharing**: Released the models and datasets used to facilitate future research work. 4. **Technical Path**: - **Dataset Construction**: Built a dataset containing 2 billion tokens by compiling Ubuntu software packages to obtain the correspondence between source code and assembly code. - **Model Training**: Conducted supervised fine-tuning based on the CodeLlama model to simulate compiler behavior and generate assembly code. - **Encoder Training**: Used contrastive learning to train an assembly code encoder, further optimizing the model's ability to search assembly code. 5. **Evaluation and Validation**: - **Quality Evaluation of the Virtual Compiler**: Assessed the quality of the assembly code generated by the virtual compiler through various metrics, including sequence similarity, runtime similarity, and semantic similarity. - **Case Analysis**: Conducted a detailed comparison of the differences between the assembly code generated by the virtual compiler and the actual compiler, showcasing different types of mismatches. - **Validation of Assembly Code Search Capability**: Verified the model's performance improvement in assembly code search tasks by constructing a real evaluation dataset. In summary, this research significantly improves the efficiency and accuracy of assembly code search by proposing the concept and method of a virtual compiler, providing decompilation engineers with more efficient tools.

ViC: Virtual Compiler Is All You Need For Assembly Code Search

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

CompCodeVet: A Compiler-guided Validation and Enhancement Approach for Code Dataset

CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization

Large Language Models for Compiler Optimization

AutoVCoder: A Systematic Framework for Automated Verilog Code Generation using LLMs

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Pluvio: Assembly Clone Search for Out-of-domain Architectures and Libraries through Transfer Learning and Conditional Variational Information Bottleneck

VISUALCODER: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

LLM-Vectorizer: LLM-based Verified Loop Vectorizer

Large Language Models as Code Executors: An Exploratory Study

Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries

VeriGen: A Large Language Model for Verilog Code Generation

DiffCoder: Enhancing Large Language Model on API Invocation via Analogical Code Exercises

LLM4Decompile: Decompiling Binary Code with Large Language Models

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

Research and Implementation of Virtual Assembly System

Guess & Sketch: Language Model Guided Transpilation

Llasm: Naming Functions in Binaries by Fusing Encoder-only and Decoder-only LLMs