Forklift: An Extensible Neural Lifter

Jordi Armengol-Estapé,Rodrigo C. O. Rocha,Jackson Woodruff,Pasquale Minervini,Michael F.P. O'Boyle
2024-04-02
Abstract:The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages. However, the development of these tools requires substantial engineering effort. State-of-the-art approaches use lifting, a technique where source assembly code is translated to an architecture-independent intermediate representation (IR) (for example, the LLVM IR) and use a pre-existing compiler to recompile the IR to the target ISA. However, the hand-written rules these lifters employ are sensitive to the particular compiler and optimization level used to generate the code and require significant engineering effort to support each new ISA. We propose Forklift, the first neural lifter that learns how to translate assembly to LLVM IR using a token-level encoder-decoder Transformer. We show how to incrementally add support to new ISAs by fine tuning the assembly encoder and freezing the IR decoder, improving the overall accuracy and efficiency. We collect millions of parallel LLVM IR, x86, ARM, and RISC-V programs across compilers and optimization levels to train Forklift and set up an input/output-based accuracy harness. We evaluate Forklift on two challenging benchmark suites and translate 2.5x more x86 programs than a state-of-the-art hand-written lifter and 4.4x more x86 programs than GPT-4 as well as enabling translation from new ISAs.
Programming Languages,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the growing need to migrate legacy software between different Instruction Set Architectures (ISAs), especially when this software exists only in binary form and there is no original source code. Traditional solutions such as hand - written binary translators or lifters can lift the source binary language to a shared Compiler Intermediate Representation (IR), but these tools require a great deal of engineering effort to support each new ISA, and they rely on specific compilers and optimization levels, which limits their flexibility and adaptability. To solve these problems, the paper proposes **Forklift**, a neural - network - based lifter that uses the Transformer model to directly learn to generate LLVM IR from assembly code. Through this method, Forklift can reduce the amount of manual engineering work required to support new ISAs while improving the accuracy of translation. Specifically, Forklift solves the above problems in the following ways: 1. **Automated Learning Lifting**: Forklift uses a Transformer - based encoder - decoder model, taking assembly code as input and directly generating the corresponding LLVM IR. This end - to - end learning method reduces the dependence on specific compilers and optimization levels, making the model more general and flexible. 2. **Incremental Learning and Supporting New ISAs**: To support new ISAs, Forklift adopts an incremental learning method. Specifically, for a new ISA, it can be achieved by fine - tuning the existing assembly encoder and freezing the LLVM IR decoder. This not only maintains the accuracy on previous ISAs but also improves the efficiency of supporting new ISAs. 3. **Large - scale Datasets**: To train Forklift, the authors collected millions of parallel LLVM IR, x86, ARM, and RISC - V programs, covering multiple compilers and optimization levels. This ensures that the model can be trained on diverse data, thereby improving its generalization ability. 4. **Performance Evaluation**: The paper conducts a detailed performance evaluation of Forklift, including comparisons with existing state - of - the - art hand - written lifters (such as Lasagne) and large - language models (such as GPT - 4). The experimental results show that Forklift performs excellently in multiple benchmark tests, especially when processing optimized x86 code, its accuracy is 2.5 times and 4.4 times higher than that of Lasagne and GPT - 4 respectively. In summary, Forklift aims to reduce the engineering effort required to migrate legacy software between different ISAs through automated and incremental learning methods while improving the accuracy and efficiency of translation.