A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

Zibin Zheng,Kaiwen Ning,Yanlin Wang,Jingwen Zhang,Dewu Zheng,Mingxi Ye,Jiachi Chen

2024-01-08

Abstract:General large language models (LLMs), represented by ChatGPT, have demonstrated significant potential in tasks such as code generation in software engineering. This has led to the development of specialized LLMs for software engineering, known as Code LLMs. A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning. As a result, Code LLMs are often updated frequently and their performance can be influenced by the base LLMs. However, there is currently a lack of systematic investigation into Code LLMs and their performance. In this study, we conduct a comprehensive survey and analysis of the types of Code LLMs and their differences in performance compared to general LLMs. We aim to address three questions: (1) What LLMs are specifically designed for software engineering tasks, and what is the relationship between these Code LLMs? (2) Do Code LLMs really outperform general LLMs in software engineering tasks? (3) Which LLMs are more proficient in different software engineering tasks? To answer these questions, we first collect relevant literature and work from five major databases and open-source communities, resulting in 134 works for analysis. Next, we categorize the Code LLMs based on their publishers and examine their relationships with general LLMs and among themselves. Furthermore, we investigate the performance differences between general LLMs and Code LLMs in various software engineering tasks to demonstrate the impact of base models and Code LLMs. Finally, we comprehensively maintained the performance of LLMs across multiple mainstream benchmarks to identify the best-performing LLMs for each software engineering task. Our research not only assists developers of Code LLMs in choosing base models for the development of more advanced LLMs but also provides insights for practitioners to better understand key improvement directions for Code LLMs.

Software Engineering

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **What are the large language models (LLMs) specifically designed for software engineering tasks, and what are their relationships?** The paper categorizes Code LLMs and classifies them based on the developers' affiliations (such as companies, universities, etc.), providing the development history of these models and their iterative, fine-tuning, and improvement relationships. 2. **Are Code LLMs truly superior to general LLMs in software engineering tasks?** Through comparative experimental data, the paper finds that new models specifically fine-tuned for software engineering tasks generally outperform their base models. When the number of parameters is comparable, Code LLMs often exhibit better performance. Particularly, the current state-of-the-art Code LLMs (such as CodeFuse-CodeLlama-34B) outperform general LLMs (such as GPT-4) in code generation tasks, although GPT-4 remains competitive in other tasks. 3. **Which LLMs perform better in different software engineering tasks?** The paper summarizes the performance of 130 major Code LLMs in key benchmarks and organizes the evaluation methods used by different studies or proposes new benchmarks and evaluation metrics. The results show that different Code LLMs exhibit varying performance across various software engineering tasks, which helps developers choose the base model and fine-tuning methods more suitable for specific tasks. Through the above research, the paper systematically reviews the current status and future trends of Code LLMs, providing valuable insights for researchers and practitioners to improve and optimize the application of these models in the field of software engineering.

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

A Survey on Large Language Models for Code Generation

A Survey on Evaluating Large Language Models in Code Generation Tasks

Towards an Understanding of Large Language Models in Software Engineering Tasks

Evaluating Large Language Models in Class-Level Code Generation

A Survey on Large Language Models for Software Engineering

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

Large Language Models in Computer Science Education: A Systematic Literature Review

Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code

An Empirical Study on Low Code Programming using Traditional vs Large Language Model Support

Large Language Models for Software Engineering: A Systematic Literature Review

Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Software Service Engineering in the Era of Large Language Models

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

Where Are Large Language Models for Code Generation on GitHub?

Examination of Code generated by Large Language Models

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

Large Language Models as Code Executors: An Exploratory Study