Abstract:To support software developers in understanding and maintaining programs, various automatic (source) code summarization techniques have been proposed to generate a concise natural language summary (i.e., comment) for a given code snippet. Recently, the emergence of large language models (LLMs) has led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs, which covers multiple aspects involved in the workflow of LLM-based code summarization. Specifically, we begin by examining prevalent automated evaluation methods for assessing the quality of summaries generated by LLMs and find that the results of the GPT-4 evaluation method are most closely aligned with human evaluation. Then, we explore the effectiveness of five prompting techniques (zero-shot, few-shot, chain-of-thought, critique, and expert) in adapting LLMs to code summarization tasks. Contrary to expectations, advanced prompting techniques may not outperform simple zero-shot prompting. Next, we investigate the impact of LLMs' model settings (including top\_p and temperature parameters) on the quality of generated summaries. We find the impact of the two parameters on summary quality varies by the base LLM and programming language, but their impacts are similar. Moreover, we canvass LLMs' abilities to summarize code snippets in distinct types of programming languages. The results reveal that LLMs perform suboptimally when summarizing code written in logic programming languages compared to other language types. Finally, we unexpectedly find that CodeLlama-Instruct with 7B parameters can outperform advanced GPT-4 in generating summaries describing code implementation details and asserting code properties. We hope that our findings can provide a comprehensive understanding of code summarization in the era of LLMs.

What problem does this paper attempt to address?

This paper attempts to solve the following problems: 1. **Applicability of evaluation methods**: Are the existing automatic evaluation methods suitable for evaluating the quality of code summaries generated by large language models (LLMs)? Traditional evaluation methods based on text similarity and semantic similarity may not be applicable to summaries generated by LLMs because these summaries have significant differences in expression from the reference summaries. Therefore, a more appropriate automated evaluation method needs to be found. 2. **Effectiveness of different prompting techniques**: How effective are different prompting techniques (such as zero - sample, few - sample, chain - of - thought, critic, and expert prompting) when adapting LLMs to the code summarization task? The study found that advanced prompting techniques are not necessarily better than simple zero - sample prompting, and the specific effect depends on the underlying model of the LLM and the programming language. 3. **Impact of model settings on performance**: What is the impact of two key parameters (top p and temperature) of LLMs on the quality of code summaries? The research shows that the impact of these two parameters on the summary quality varies depending on the underlying LLM and the programming language. 4. **Summary capabilities for different programming language types**: What are the differences in the code summary capabilities of LLMs in different types of programming languages (such as procedural, object - oriented, scripting, functional, and logical programming languages)? The results show that LLMs perform the worst on logical programming languages. 5. **Generation capabilities for different types of summaries**: How do LLMs perform when generating different types of summaries (such as functional descriptions, usage methods, implementation details, etc.)? The study found that different LLMs have their own advantages and disadvantages in different types of summaries. ### Main contributions - **For the first time, explored the possibility of using LLMs as an evaluation tool** to evaluate the quality of code summaries generated by LLMs. - **Systematically studied the code summarization problem in the era of LLMs**, covering in - depth analysis in multiple aspects and revealing some novel and unexpected findings. - **Made the dataset and source code public**, so that other researchers can replicate the experiments and apply them in a wide range of situations. Through these studies, the authors hope to provide future researchers with a comprehensive understanding of LLMs in the code summarization task and help design more advanced LLM - based code summarization techniques.

Source Code Summarization in the Era of Large Language Models

Can Large Language Models Serve as Evaluators for Code Summarization?

Context-aware Code Summary Generation

Why My Code Summarization Model Does Not Work

Why My Code Summarization Model Does Not Work: Code Comment Improvement with Category Prediction

Automatic Code Summarization via ChatGPT: How Far Are We?

Project-Specific Code Summarization with In-Context Learning

Interpretation-based Code Summarization.

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

Analyzing the Performance of Large Language Models on Code Summarization

A Prompt Learning Framework for Source Code Summarization

Demystifying Code Summarization Models.

Neural Code Summarization: How Far Are We?

On the Evaluation of Neural Code Summarization

Large Language Models for Code Summarization

CodeSum: Translate Program Language to Natural Language

Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Do Code Summarization Models Process Too Much Information? Function Signature May Be All What Is Needed

Do Machines and Humans Focus on Similar Code? Exploring Explainability of Large Language Models in Code Summarization