ViC: Virtual Compiler Is All You Need For Assembly Code Search

Zeyu Gao,Hao Wang,Yuanda Wang,Chao Zhang
2024-08-11
Abstract:Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlama as a Virtual Compiler (ViC), capable of compiling any source code of any language to assembly code. This approach allows for virtual compilation across a wide range of programming languages without the need for a real compiler, preserving semantic equivalency and expanding the possibilities for assembly code dataset construction. Furthermore, we use ViC to construct a sufficiently large dataset for assembly code search. Employing this extensive dataset, we achieve a substantial improvement in assembly code search performance, with our model surpassing the leading baseline by 26%.
Software Engineering,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper primarily aims to address the problem of quickly locating specific functions in binary programs, especially in decompilation scenarios. Specifically, the research focuses on the following key points: 1. **Challenge the Status Quo**: Traditional decompilation engineers typically rely on finding unique strings or constants to locate code segments with specific functions. This method is inefficient and time-consuming. 2. **Objective**: Develop a method that allows users to search for specific functions in binary files through natural language descriptions, thereby improving interactivity and search efficiency. 3. **Main Contributions**: - **Introduction of the Virtual Compiler (ViC)**: Utilizing large language models (LLM) to simulate the behavior of general compilers, achieving a virtual compilation process from source code to assembly code. This method is not only applicable to C/C++ languages but can also be extended to other programming languages such as Python and Golang. - **Improvement in Assembly Code Search Performance**: Constructed a high-quality assembly code dataset and trained a model that improved performance by 26% in assembly code search tasks compared to existing state-of-the-art technologies. - **Resource Sharing**: Released the models and datasets used to facilitate future research work. 4. **Technical Path**: - **Dataset Construction**: Built a dataset containing 2 billion tokens by compiling Ubuntu software packages to obtain the correspondence between source code and assembly code. - **Model Training**: Conducted supervised fine-tuning based on the CodeLlama model to simulate compiler behavior and generate assembly code. - **Encoder Training**: Used contrastive learning to train an assembly code encoder, further optimizing the model's ability to search assembly code. 5. **Evaluation and Validation**: - **Quality Evaluation of the Virtual Compiler**: Assessed the quality of the assembly code generated by the virtual compiler through various metrics, including sequence similarity, runtime similarity, and semantic similarity. - **Case Analysis**: Conducted a detailed comparison of the differences between the assembly code generated by the virtual compiler and the actual compiler, showcasing different types of mismatches. - **Validation of Assembly Code Search Capability**: Verified the model's performance improvement in assembly code search tasks by constructing a real evaluation dataset. In summary, this research significantly improves the efficiency and accuracy of assembly code search by proposing the concept and method of a virtual compiler, providing decompilation engineers with more efficient tools.