Abstract:Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc$^2$dec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 3.90% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 52.41%. The code, data, and models are available at <a class="link-external link-https" href="https://github.com/AlongWY/sccdec" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges encountered in the decompilation process, especially converting the compiled machine code or byte code back to a high - level programming language without the source code. Specifically, the paper proposes two methods to improve the performance of the decompilation model: 1. **Self - Constructed Context Decompilation (sc2dec)**: - **Problem**: The code generated by existing decompilation tools is often difficult to read and cannot fully reconstruct the structure and details of the original code (such as variable names and main structures), and this information is often lost during the compilation process. - **Solution**: The sc2dec method constructs alignment pairs (assembly code and source code pairs) for context learning by recompiling the decompilation results generated by the model. This method does not require fine - tuning the model but improves the decompilation performance by using the context - learning ability. 2. **Fine - grained Alignment Enhancement (FAE)**: - **Problem**: Existing methods lack the processing of fine - grained alignment between assembly code and high - level code during the decompilation process, resulting in inaccurate functionality of the generated code. - **Solution**: The FAE method introduces a fine - grained alignment enhancement technique in the fine - tuning stage, precisely aligning the assembly code with the high - level code at the statement level through debugging information, thereby further improving the accuracy of decompilation. By combining these two methods, the paper achieves an approximately 3.90% performance improvement in re - executability in the Decompile - Eval benchmark test, reaching a new state - of - the - art level of 52.41% re - executability. ### Summary The main contributions of the paper are: - Proposing the self - constructed context decompilation (sc2dec) method, which uses the compilability of the decompilation results to construct a better context. - Introducing the fine - grained alignment enhancement (FAE) method, using the fine - grained alignment data extracted from the debugging information to fine - tune the model and proposing how to automatically synthesize its training data. - The experimental results show that the proposed model has achieved the latest state - of - the - art results in the Decompile - Eval benchmark test. These methods not only improve the performance of decompilation but also provide new ideas and technical means for future research.

Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement

IRaDT: LLVM IR as Target for Efficient Neural Decompilation

WaDec: Decompiling WebAssembly Using Large Language Model

LLM4Decompile: Decompiling Binary Code with Large Language Models

Beyond the C: Retargetable Decompilation using Neural Machine Translation

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

Semantics-Recovering Decompilation through Neural Machine Translation

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models

Disassembling Obfuscated Executables with LLM

HexT5: Unified Pre-Training for Stripped Binary Code Information Inference.

Decomposition for Enhancing Attention: Improving LLM-based Text-to-SQL through Workflow Paradigm

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models for Task-Aware Parameter-Efficient Fine-tuning

SLaDe: A Portable Small Language Model Decompiler for Optimized Assembly

Improving type information inferred by decompilers with supervised machine learning

Boosting Neural Networks to Decompile Optimized Binaries

StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation

Demystifying and Assessing Code Understandability in Java Decompilation

AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge

ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

The Incredible Shrinking Context... in a decompiler near you

SelfCodeAlign: Self-Alignment for Code Generation