From CISC to RISC: language-model guided assembly transpilation

Ahmed Heakl,Chaimaa Abi,Rania Hossam,Abdulrahman Mahmoud
2024-11-25
Abstract:The transition from x86 to ARM architecture is becoming increasingly common across various domains, primarily driven by ARM's energy efficiency and improved performance across traditional sectors. However, this ISA shift poses significant challenges, mainly due to the extensive legacy ecosystem of x86 software and lack of portability across proprietary ecosystems and software stacks. This paper introduces CRT, a lightweight LLM-based transpiler that automatically converts x86 assembly to ARM assembly. Our approach bridges the fundamental architectural gap between x86's CISC-based and ARM's RISC-based computing paradigms while preserving program semantics and optimizing performance. We evaluate CRT on diverse real-world applications, achieving 79.25% translation accuracy from x86 to ARMv5 on our comprehensive test suite, and an 88.68% accuracy from x86 to RISC-V. In practical deployments on Apple M2 hardware (ARMv8), our transpiled code achieves 1.73$\times$ speedup compared to Apple's Rosetta 2 virtualization engine, while delivering 2.41$\times$ memory efficiency and 1.47$\times$ better energy consumption. Through testing and analysis, we show that CRT successfully navigates the CISC/RISC divide and generates correctly executable RISC code despite machine ``language'' barriers. We release our code, models, training datasets, and benchmarks at: \url{<a class="link-external link-https" href="https://ahmedheakl.github.io/asm2asm/" rel="external noopener nofollow">this https URL</a>}.
Programming Languages,Hardware Architecture
What problem does this paper attempt to address?
This paper attempts to solve the conversion problem from the x86 architecture to the ARM architecture. Specifically, the paper aims to develop a direct translation tool that can automatically convert x86 assembly code into ARM assembly code while maintaining the semantic correctness and performance optimization of the program. The main challenges in this conversion process include: 1. **Instruction Set Differences**: The x86 architecture belongs to Complex Instruction Set Computers (CISC), while the ARM architecture belongs to Reduced Instruction Set Computers (RISC). There are significant differences between these two architectures in terms of instruction processing methods, register usage, and memory access. 2. **Code Compatibility**: Since a large amount of legacy software is written based on the x86 architecture, running this software directly on the ARM architecture requires recompilation or conversion, which involves complex code porting work. 3. **Performance and Efficiency**: Existing virtualization solutions (such as QEMU and Apple's Rosetta 2) can achieve cross - architecture operation, but they will introduce significant performance overhead. Therefore, a more efficient method is needed to achieve code conversion. To address these challenges, the paper proposes CRT (CISC to RISC Transpiler), a lightweight transpiler based on a Language Model (LLM). CRT solves the above problems through the following methods: - **Language Model Training**: Use a large - scale paired data set (pairs of x86 and ARM assembly code) to train the language model so that it can learn and understand the mapping relationship between the two architectures. - **Grammar and Functional Correctness**: Through strict testing and evaluation, ensure that the generated ARM code is not only grammatically correct but also functionally consistent with the original x86 code. - **Performance Optimization**: Optimize model parameters and training strategies to improve the performance of the converted code in practical applications. The paper verifies the effectiveness of CRT through experiments and shows its superior performance in various practical applications, especially on Apple M2 hardware. Specifically, the code generated by CRT is superior to Apple's Rosetta 2 virtualization engine in terms of speed, memory efficiency, and energy consumption.