What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This technical report aims to evaluate the performance of large language models (LLMs) in handling the task of converting between source code and natural language text. Specifically, the authors focus on the following two main aspects: 1. **Code Generation (Text-to-Code)**: - Assessing the ability of these models to generate source code based on natural language descriptions. - Using multiple benchmark datasets (such as HumanEval, APPS, MBPP, and DS-1000) to measure the models' performance. 2. **Code Explanation/Summarization (Code-to-Text)**: - Evaluating the ability of these models to convert source code into natural language explanations or summaries. - Using some benchmark datasets (such as CodeXGLUE and HumanEvalExplain) to measure the models' performance. ### Background and Motivation In recent years, the application of deep learning in software engineering has become increasingly widespread, particularly in code generation and summarization tasks. Some recent large language models have shown excellent performance in these tasks. However, there is relatively little research on the specific performance of these models in code explanation and summarization. Therefore, this technical report aims to fill this gap by systematically evaluating and comparing the performance of various open-source large language models in these two tasks. ### Main Objectives 1. **Evaluate the Performance of Existing Models**: - Systematically evaluate the performance of different open-source large language models in code generation and explanation/summarization tasks using multiple benchmark datasets. - Compare the strengths and weaknesses of different models in these tasks. 2. **Explore the Potential and Limitations of the Models**: - Analyze the advantages and shortcomings of existing models in code generation and explanation/summarization tasks. - Propose future research directions and improvement methods. ### Methodology - **Benchmark Datasets**: - Code Generation Tasks: HumanEval, APPS, MBPP, DS-1000, etc. - Code Explanation/Summarization Tasks: CodeXGLUE, HumanEvalExplain, etc. - **Evaluation Metrics**: - Code Generation: Pass@k, BLEU, ROUGE, etc. - Code Explanation/Summarization: Pass@k, BLEU, ROUGE, etc. ### Conclusion Through systematic evaluation, the authors found that: - In code generation tasks, some models (such as CodeLlama, DeepSeekCoder, MagiCoder, etc.) performed excellently, especially on the HumanEval and MBPP benchmarks. - In code explanation/summarization tasks, MagiCoder (DS-6.7B) performed best in Python code explanation, while WaveCoder (DS-6.7B) performed best in explaining code in multiple languages. Although many models perform well in code generation tasks, there is still significant room for improvement in code explanation/summarization tasks. This indicates that future research needs to focus more on how to improve the performance of models in code explanation/summarization tasks.

Large Language Models for Code Summarization

Analyzing the Performance of Large Language Models on Code Summarization

Source Code Summarization in the Era of Large Language Models

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

Can Large Language Models Serve as Evaluators for Code Summarization?

Context-aware Code Summary Generation

Large Language Models Meet NL2Code: A Survey

CodeSum: Translate Program Language to Natural Language

Can Large Language Models Write Parallel Code?

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Interpretation-based Code Summarization.

Large Language Models for Scientific Synthesis, Inference and Explanation

Demystifying Code Summarization Models.

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

ClassSum: a Deep Learning Model for Class-Level Code Summarization

Scientific Computing with Large Language Models

A Survey on Large Language Models for Code Generation

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

Effective Approaches to Combining Lexical and Syntactical Information for Code Summarization

Scaling Up Video Summarization Pretraining with Large Language Models