Source Code Summarization in the Era of Large Language Models

Weisong Sun,Yun Miao,Yuekang Li,Hongyu Zhang,Chunrong Fang,Yi Liu,Gelei Deng,Yang Liu,Zhenyu Chen
2024-07-09
Abstract:To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types. Finally, we unexpectedly find that CodeLlama-Instruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the following problems: 1. **Applicability of evaluation methods**: Are the existing automatic evaluation methods suitable for evaluating the quality of code summaries generated by large language models (LLMs)? Traditional evaluation methods based on text similarity and semantic similarity may not be applicable to summaries generated by LLMs because these summaries have significant differences in expression from the reference summaries. Therefore, a more appropriate automated evaluation method needs to be found. 2. **Effectiveness of different prompting techniques**: How effective are different prompting techniques (such as zero - sample, few - sample, chain - of - thought, critic, and expert prompting) when adapting LLMs to the code summarization task? The study found that advanced prompting techniques are not necessarily better than simple zero - sample prompting, and the specific effect depends on the underlying model of the LLM and the programming language. 3. **Impact of model settings on performance**: What is the impact of two key parameters (top p and temperature) of LLMs on the quality of code summaries? The research shows that the impact of these two parameters on the summary quality varies depending on the underlying LLM and the programming language. 4. **Summary capabilities for different programming language types**: What are the differences in the code summary capabilities of LLMs in different types of programming languages (such as procedural, object - oriented, scripting, functional, and logical programming languages)? The results show that LLMs perform the worst on logical programming languages. 5. **Generation capabilities for different types of summaries**: How do LLMs perform when generating different types of summaries (such as functional descriptions, usage methods, implementation details, etc.)? The study found that different LLMs have their own advantages and disadvantages in different types of summaries. ### Main contributions - **For the first time, explored the possibility of using LLMs as an evaluation tool** to evaluate the quality of code summaries generated by LLMs. - **Systematically studied the code summarization problem in the era of LLMs**, covering in - depth analysis in multiple aspects and revealing some novel and unexpected findings. - **Made the dataset and source code public**, so that other researchers can replicate the experiments and apply them in a wide range of situations. Through these studies, the authors hope to provide future researchers with a comprehensive understanding of LLMs in the code summarization task and help design more advanced LLM - based code summarization techniques.