Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement

Yunlong Feng,Dechuan Teng,Yang Xu,Honglin Mu,Xiao Xu,Libo Qin,Qingfu Zhu,Wanxiang Che
2024-10-03
Abstract:Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc$^2$dec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 3.90% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 52.41%. The code, data, and models are available at <a class="link-external link-https" href="https://github.com/AlongWY/sccdec" rel="external noopener nofollow">this https URL</a>.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges encountered in the decompilation process, especially converting the compiled machine code or byte code back to a high - level programming language without the source code. Specifically, the paper proposes two methods to improve the performance of the decompilation model: 1. **Self - Constructed Context Decompilation (sc2dec)**: - **Problem**: The code generated by existing decompilation tools is often difficult to read and cannot fully reconstruct the structure and details of the original code (such as variable names and main structures), and this information is often lost during the compilation process. - **Solution**: The sc2dec method constructs alignment pairs (assembly code and source code pairs) for context learning by recompiling the decompilation results generated by the model. This method does not require fine - tuning the model but improves the decompilation performance by using the context - learning ability. 2. **Fine - grained Alignment Enhancement (FAE)**: - **Problem**: Existing methods lack the processing of fine - grained alignment between assembly code and high - level code during the decompilation process, resulting in inaccurate functionality of the generated code. - **Solution**: The FAE method introduces a fine - grained alignment enhancement technique in the fine - tuning stage, precisely aligning the assembly code with the high - level code at the statement level through debugging information, thereby further improving the accuracy of decompilation. By combining these two methods, the paper achieves an approximately 3.90% performance improvement in re - executability in the Decompile - Eval benchmark test, reaching a new state - of - the - art level of 52.41% re - executability. ### Summary The main contributions of the paper are: - Proposing the self - constructed context decompilation (sc2dec) method, which uses the compilability of the decompilation results to construct a better context. - Introducing the fine - grained alignment enhancement (FAE) method, using the fine - grained alignment data extracted from the debugging information to fine - tune the model and proposing how to automatically synthesize its training data. - The experimental results show that the proposed model has achieved the latest state - of - the - art results in the Decompile - Eval benchmark test. These methods not only improve the performance of decompilation but also provide new ideas and technical means for future research.