Program Decomposition and Translation with Static Analysis

Ali Reza Ibrahimzada
DOI: https://doi.org/10.1145/3639478.3641226
2024-01-23
Abstract:The rising popularity of Large Language Models (LLMs) has motivated exploring their use in code-related tasks. Code LLMs with more than millions of parameters are trained on a massive amount of code in different Programming Languages (PLs). Such models are used for automating various Software Engineering (SE) tasks using prompt engineering. However, given the very large size of industry-scale project files, a major issue of these LLMs is their limited context window size, motivating the question of "Can these LLMs process very large files and can we effectively perform prompt engineering?". Code translation aims to convert source code from one PL to another. In this work, we assess the effect of method-level program decomposition on context window of LLMs and investigate how this approach can enable translation of very large files which originally could not be done due to out-of-context issue. Our observations from 20 well-known java projects and approximately 60K methods suggest that method-level program decomposition significantly improves the limited context window problem of LLMs by 99.5%. Furthermore, our empirical analysis indicate that with method-level decomposition, each input fragment on average only consumes 5% of the context window, leaving more context space for prompt engineering and the output. Finally, we investigate the effectiveness of a Call Graph (CG) approach for translating very large files when doing method-level program decomposition.
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the issue of context window size limitations faced by large language models (LLMs) when handling large-scale code files. Specifically, due to the limited context window of LLMs, they are unable to process industrial-scale project files, leading to "out-of-context" problems during tasks such as code translation. The authors address this issue through method-level program decomposition techniques, aiming to enhance the ability of LLMs to handle very large files and optimize the effectiveness of prompt engineering. ### Main Research Questions: 1. **Can industrial-scale projects fit within the context window of LLMs?** - Researchers downloaded 20 well-maintained Apache projects and analyzed the .java files of each project, finding that approximately 30% of the files could not fit within a 2K context window model. 2. **Can method-level program decomposition solve the "out-of-context" problem for LLMs?** - By statically analyzing and decomposing each class into method fragments, the results showed that method-level program decomposition significantly improved the "out-of-context" problem, reducing the issue from 27.73% to 0.14%, with each method fragment consuming only 5% of the context window on average. 3. **Can program decomposition techniques be combined with code translation tasks to achieve large-scale file translation?** - Using the Call Graph (CG) method for code translation, the results showed that method-level program decomposition effectively solved the "out-of-context" problem during translation, with all source files successfully translated and only consuming 3% of the context window. ### Main Conclusions: - **Method-level program decomposition** significantly enhanced the ability of LLMs to handle large-scale input files, solving the "out-of-context" problem. - **Call Graph (CG) method** performed well in code translation, effectively translating large-scale files. - **Prompt engineering** effectiveness was also improved, as method-level decomposition reduced the context space occupied by the input, leaving more space for prompts and outputs. Through this research, the authors demonstrated how program decomposition techniques can optimize the performance of LLMs in handling large-scale code files, providing new insights for software engineering automation.