Abstract:The rising popularity of Large Language Models (LLMs) has motivated exploring their use in code-related tasks. Code LLMs with more than millions of parameters are trained on a massive amount of code in different Programming Languages (PLs). Such models are used for automating various Software Engineering (SE) tasks using prompt engineering. However, given the very large size of industry-scale project files, a major issue of these LLMs is their limited context window size, motivating the question of "Can these LLMs process very large files and can we effectively perform prompt engineering?". Code translation aims to convert source code from one PL to another. In this work, we assess the effect of method-level program decomposition on context window of LLMs and investigate how this approach can enable translation of very large files which originally could not be done due to out-of-context issue. Our observations from 20 well-known java projects and approximately 60K methods suggest that method-level program decomposition significantly improves the limited context window problem of LLMs by 99.5%. Furthermore, our empirical analysis indicate that with method-level decomposition, each input fragment on average only consumes 5% of the context window, leaving more context space for prompt engineering and the output. Finally, we investigate the effectiveness of a Call Graph (CG) approach for translating very large files when doing method-level program decomposition.

What problem does this paper attempt to address?

The paper attempts to address the issue of context window size limitations faced by large language models (LLMs) when handling large-scale code files. Specifically, due to the limited context window of LLMs, they are unable to process industrial-scale project files, leading to "out-of-context" problems during tasks such as code translation. The authors address this issue through method-level program decomposition techniques, aiming to enhance the ability of LLMs to handle very large files and optimize the effectiveness of prompt engineering. ### Main Research Questions: 1. **Can industrial-scale projects fit within the context window of LLMs?** - Researchers downloaded 20 well-maintained Apache projects and analyzed the .java files of each project, finding that approximately 30% of the files could not fit within a 2K context window model. 2. **Can method-level program decomposition solve the "out-of-context" problem for LLMs?** - By statically analyzing and decomposing each class into method fragments, the results showed that method-level program decomposition significantly improved the "out-of-context" problem, reducing the issue from 27.73% to 0.14%, with each method fragment consuming only 5% of the context window on average. 3. **Can program decomposition techniques be combined with code translation tasks to achieve large-scale file translation?** - Using the Call Graph (CG) method for code translation, the results showed that method-level program decomposition effectively solved the "out-of-context" problem during translation, with all source files successfully translated and only consuming 3% of the context window. ### Main Conclusions: - **Method-level program decomposition** significantly enhanced the ability of LLMs to handle large-scale input files, solving the "out-of-context" problem. - **Call Graph (CG) method** performed well in code translation, effectively translating large-scale files. - **Prompt engineering** effectiveness was also improved, as method-level decomposition reduced the context space occupied by the input, leaving more space for prompts and outputs. Through this research, the authors demonstrated how program decomposition techniques can optimize the performance of LLMs in handling large-scale code files, providing new insights for software engineering automation.

Program Decomposition and Translation with Static Analysis

Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code

Repository-Level Compositional Code Translation and Validation

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation

Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring

Escalating LLM-based Code Translation Benchmarking into the Class-level Era

Scalable, Validated Code Translation of Entire Projects using Large Language Models

Exploring and Unleashing the Power of Large Language Models in Automated Code Translation

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation

Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming

Towards Translating Real-World Code with LLMs: A Study of Translating to Rust

LMs: Understanding Code Syntax and Semantics for Code Analysis

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

An Empirical Study on Low Code Programming using Traditional vs Large Language Model Support

Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models

LPR: Large Language Models-Aided Program Reduction

Empirical Studies of Parameter Efficient Methods for Large Language Models of Code and Knowledge Transfer to R

Program Slicing in the Era of Large Language Models

Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs