Large Language Models for Code Summarization

Balázs Szalontai,Gergő Szalay,Tamás Márton,Anna Sike,Balázs Pintér,Tibor Gregorics
2024-05-29
Abstract:Recently, there has been increasing activity in using deep learning for software engineering, including tasks like code generation and summarization. In particular, the most recent coding Large Language Models seem to perform well on these problems. In this technical report, we aim to review how these models perform in code explanation/summarization, while also investigating their code generation capabilities (based on natural language descriptions).
Artificial Intelligence,Machine Learning,Programming Languages,Software Engineering
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This technical report aims to evaluate the performance of large language models (LLMs) in handling the task of converting between source code and natural language text. Specifically, the authors focus on the following two main aspects: 1. **Code Generation (Text-to-Code)**: - Assessing the ability of these models to generate source code based on natural language descriptions. - Using multiple benchmark datasets (such as HumanEval, APPS, MBPP, and DS-1000) to measure the models' performance. 2. **Code Explanation/Summarization (Code-to-Text)**: - Evaluating the ability of these models to convert source code into natural language explanations or summaries. - Using some benchmark datasets (such as CodeXGLUE and HumanEvalExplain) to measure the models' performance. ### Background and Motivation In recent years, the application of deep learning in software engineering has become increasingly widespread, particularly in code generation and summarization tasks. Some recent large language models have shown excellent performance in these tasks. However, there is relatively little research on the specific performance of these models in code explanation and summarization. Therefore, this technical report aims to fill this gap by systematically evaluating and comparing the performance of various open-source large language models in these two tasks. ### Main Objectives 1. **Evaluate the Performance of Existing Models**: - Systematically evaluate the performance of different open-source large language models in code generation and explanation/summarization tasks using multiple benchmark datasets. - Compare the strengths and weaknesses of different models in these tasks. 2. **Explore the Potential and Limitations of the Models**: - Analyze the advantages and shortcomings of existing models in code generation and explanation/summarization tasks. - Propose future research directions and improvement methods. ### Methodology - **Benchmark Datasets**: - Code Generation Tasks: HumanEval, APPS, MBPP, DS-1000, etc. - Code Explanation/Summarization Tasks: CodeXGLUE, HumanEvalExplain, etc. - **Evaluation Metrics**: - Code Generation: Pass@k, BLEU, ROUGE, etc. - Code Explanation/Summarization: Pass@k, BLEU, ROUGE, etc. ### Conclusion Through systematic evaluation, the authors found that: - In code generation tasks, some models (such as CodeLlama, DeepSeekCoder, MagiCoder, etc.) performed excellently, especially on the HumanEval and MBPP benchmarks. - In code explanation/summarization tasks, MagiCoder (DS-6.7B) performed best in Python code explanation, while WaveCoder (DS-6.7B) performed best in explaining code in multiple languages. Although many models perform well in code generation tasks, there is still significant room for improvement in code explanation/summarization tasks. This indicates that future research needs to focus more on how to improve the performance of models in code explanation/summarization tasks.