Abstract:The escalating demand to migrate legacy software across different Instruction Set Architectures (ISAs) has driven the development of assembly-to-assembly translators to map between their respective assembly languages. However, the development of these tools requires substantial engineering effort. State-of-the-art approaches use lifting, a technique where source assembly code is translated to an architecture-independent intermediate representation (IR) (for example, the LLVM IR) and use a pre-existing compiler to recompile the IR to the target ISA. However, the hand-written rules these lifters employ are sensitive to the particular compiler and optimization level used to generate the code and require significant engineering effort to support each new ISA. We propose Forklift, the first neural lifter that learns how to translate assembly to LLVM IR using a token-level encoder-decoder Transformer. We show how to incrementally add support to new ISAs by fine tuning the assembly encoder and freezing the IR decoder, improving the overall accuracy and efficiency. We collect millions of parallel LLVM IR, x86, ARM, and RISC-V programs across compilers and optimization levels to train Forklift and set up an input/output-based accuracy harness. We evaluate Forklift on two challenging benchmark suites and translate 2.5x more x86 programs than a state-of-the-art hand-written lifter and 4.4x more x86 programs than GPT-4 as well as enabling translation from new ISAs.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the growing need to migrate legacy software between different Instruction Set Architectures (ISAs), especially when this software exists only in binary form and there is no original source code. Traditional solutions such as hand - written binary translators or lifters can lift the source binary language to a shared Compiler Intermediate Representation (IR), but these tools require a great deal of engineering effort to support each new ISA, and they rely on specific compilers and optimization levels, which limits their flexibility and adaptability. To solve these problems, the paper proposes **Forklift**, a neural - network - based lifter that uses the Transformer model to directly learn to generate LLVM IR from assembly code. Through this method, Forklift can reduce the amount of manual engineering work required to support new ISAs while improving the accuracy of translation. Specifically, Forklift solves the above problems in the following ways: 1. **Automated Learning Lifting**: Forklift uses a Transformer - based encoder - decoder model, taking assembly code as input and directly generating the corresponding LLVM IR. This end - to - end learning method reduces the dependence on specific compilers and optimization levels, making the model more general and flexible. 2. **Incremental Learning and Supporting New ISAs**: To support new ISAs, Forklift adopts an incremental learning method. Specifically, for a new ISA, it can be achieved by fine - tuning the existing assembly encoder and freezing the LLVM IR decoder. This not only maintains the accuracy on previous ISAs but also improves the efficiency of supporting new ISAs. 3. **Large - scale Datasets**: To train Forklift, the authors collected millions of parallel LLVM IR, x86, ARM, and RISC - V programs, covering multiple compilers and optimization levels. This ensures that the model can be trained on diverse data, thereby improving its generalization ability. 4. **Performance Evaluation**: The paper conducts a detailed performance evaluation of Forklift, including comparisons with existing state - of - the - art hand - written lifters (such as Lasagne) and large - language models (such as GPT - 4). The experimental results show that Forklift performs excellently in multiple benchmark tests, especially when processing optimized x86 code, its accuracy is 2.5 times and 4.4 times higher than that of Lasagne and GPT - 4 respectively. In summary, Forklift aims to reduce the engineering effort required to migrate legacy software between different ISAs through automated and incremental learning methods while improving the accuracy and efficiency of translation.

Forklift: An Extensible Neural Lifter

Guess & Sketch: Language Model Guided Transpilation

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly

A Lightweight Framework for Function Name Reassignment Based on Large-Scale Stripped Binaries

Compilation Forking: A Fast and Flexible Way of Generating Data for Compiler-Internal Machine Learning Tasks

From CISC to RISC: language-model guided assembly transpilation

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

Relay: A High-Level Compiler for Deep Learning

IRaDT: LLVM IR as Target for Efficient Neural Decompilation

Beyond the C: Retargetable Decompilation using Neural Machine Translation

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models On Heterogeneous Microcontrollers

Code Translation with Compiler Representations

Boosting Neural Networks to Decompile Optimized Binaries

Fortran performance optimisation and auto-parallelisation by leveraging MLIR-based domain specific abstractions in Flang

Verified Code Transpilation with LLMs

Designing RISC-V Instruction Set Extensions for Artificial Neural Networks: An LLVM Compiler-Driven Perspective

RAF: Holistic Compilation for Deep Learning Model Training

DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration

Rammer: Enabling Holistic Deep Learning Compiler Optimizations With Rtasks

Constructing an AI Compiler for ARM Cortex-M Devices

Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and Fusion