IRaDT: LLVM IR as Target for Efficient Neural Decompilation
Yuzhang Li,Tao Xu,Chunlu Wang
DOI: https://doi.org/10.1142/s0218194024500463
IF: 1.007
2024-10-18
International Journal of Software Engineering and Knowledge Engineering
Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. Decompilation is a widely utilized technique in reverse engineering, aimed at restoring binary code to human-readable high-level language code. However, the readability of the output from traditional decompilers is often poor. With advancements in language models, several learning-based decompilation methods have emerged. Nevertheless, the probabilistic nature of language models leads to outputs whose correctness cannot be guaranteed, necessitating further analysis by engineers to identify the corresponding functionality of the code. Inspired by compiler toolchains, we propose a novel approach to enhance the effectiveness of language models in decompilation tasks. Traditional rule-based methods and learning-based techniques are fused together in our approach, drawing insights from both paradigms. Specifically, we present a pre-trained sequence-to-sequence model called IRaDT tailored to refine decompilation outputs at the intermediate representation level. Through this hybridization, we aim to address the limitations of existing methodologies and achieve more accurate and robust decompilation. We construct a diverse decompilation dataset targeting IR and evaluated IRaDT based on this dataset. The experimental results indicate that IRaDT has the ability to improve the readability of IR while ensuring its compileability, achieving a 74% improvement compared to RetDec and a 93% improvement compared to ChatGPT.
computer science, artificial intelligence,engineering, electrical & electronic, software engineering