Abstract:Legacy software systems, written in outdated languages like MUMPS and mainframe assembly, pose challenges in efficiency, maintenance, staffing, and security. While LLMs offer promise for modernizing these systems, their ability to understand legacy languages is largely unknown. This paper investigates the utilization of LLMs to generate documentation for legacy code using two datasets: an electronic health records (EHR) system in MUMPS and open-source applications in IBM mainframe Assembly Language Code (ALC). We propose a prompting strategy for generating line-wise code comments and a rubric to evaluate their completeness, readability, usefulness, and hallucination. Our study assesses the correlation between human evaluations and automated metrics, such as code complexity and reference-based metrics. We find that LLM-generated comments for MUMPS and ALC are generally hallucination-free, complete, readable, and useful compared to ground-truth comments, though ALC poses challenges. However, no automated metrics strongly correlate with comment quality to predict or measure LLM performance. Our findings highlight the limitations of current automated measures and the need for better evaluation metrics for LLM-generated documentation in legacy systems.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is **how to use large language models (LLMs) to generate high - quality documentation for legacy code**. Specifically, the paper focuses on the following two aspects: 1. **Challenges in Legacy Code Modernization**: - Legacy systems are usually written in obsolete languages, such as MUMPS and IBM mainframe assembly language (ALC), which pose many challenges for maintenance, development, staffing, and security. - Directly translating these legacy codes into modern languages is risky because LLMs may introduce errors or inaccurate information when understanding and translating these archaic languages. 2. **Quality Assessment of Automatically Generated Documentation**: - Currently, there is a lack of effective automated evaluation methods to measure the quality of documentation generated by LLMs. - Manual evaluation of a large number of comments is costly, but every new system needs to be evaluated. Even in mainstream languages, the quality and reliability of generated documentation can vary greatly from one code base to another. ### Specific Goals of the Paper - **Propose an effective prompting strategy**: for generating line - by - line code comments, preventing LLMs from modifying the code or generating incomplete outputs. - **Develop evaluation metrics**: including human scores and automatically calculated metrics (such as code complexity, running time, reference - based metrics, etc.) to evaluate the quality of the generated comments. - **Study the performance of LLMs in dealing with legacy languages (such as MUMPS and ALC)**: Test the capabilities of LLMs through two actual datasets (electronic health record systems and IBM mainframe assembly language applications). ### Main Findings - The MUMPS and ALC code comments generated by LLMs generally have no hallucination phenomena and are highly complete, readable, and practical. - However, the existing automated metrics have a weak correlation with human scores and cannot effectively predict or measure the quality of LLM - generated documentation. - This study emphasizes the limitations of current automated evaluation methods and points out the need for better evaluation metrics to support the application of LLMs in legacy systems. Through these studies, the authors hope to provide an organization with a framework to help them decide whether they can use LLMs to automatically generate documentation for a specific code base, thereby accelerating the software modernization process.

Leveraging LLMs for Legacy Code Modernization: Challenges and Opportunities for LLM-Generated Documentation