A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

Zibin Zheng,Kaiwen Ning,Yanlin Wang,Jingwen Zhang,Dewu Zheng,Mingxi Ye,Jiachi Chen
2024-01-08
Abstract:General large language models (LLMs), represented by ChatGPT, have demonstrated significant potential in tasks such as code generation in software engineering. This has led to the development of specialized LLMs for software engineering, known as Code LLMs. A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning. As a result, Code LLMs are often updated frequently and their performance can be influenced by the base LLMs. However, there is currently a lack of systematic investigation into Code LLMs and their performance. In this study, we conduct a comprehensive survey and analysis of the types of Code LLMs and their differences in performance compared to general LLMs. We aim to address three questions: (1) What LLMs are specifically designed for software engineering tasks, and what is the relationship between these Code LLMs? (2) Do Code LLMs really outperform general LLMs in software engineering tasks? (3) Which LLMs are more proficient in different software engineering tasks? To answer these questions, we first collect relevant literature and work from five major databases and open-source communities, resulting in 134 works for analysis. Next, we categorize the Code LLMs based on their publishers and examine their relationships with general LLMs and among themselves. Furthermore, we investigate the performance differences between general LLMs and Code LLMs in various software engineering tasks to demonstrate the impact of base models and Code LLMs. Finally, we comprehensively maintained the performance of LLMs across multiple mainstream benchmarks to identify the best-performing LLMs for each software engineering task. Our research not only assists developers of Code LLMs in choosing base models for the development of more advanced LLMs but also provides insights for practitioners to better understand key improvement directions for Code LLMs.
Software Engineering
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **What are the large language models (LLMs) specifically designed for software engineering tasks, and what are their relationships?** The paper categorizes Code LLMs and classifies them based on the developers' affiliations (such as companies, universities, etc.), providing the development history of these models and their iterative, fine-tuning, and improvement relationships. 2. **Are Code LLMs truly superior to general LLMs in software engineering tasks?** Through comparative experimental data, the paper finds that new models specifically fine-tuned for software engineering tasks generally outperform their base models. When the number of parameters is comparable, Code LLMs often exhibit better performance. Particularly, the current state-of-the-art Code LLMs (such as CodeFuse-CodeLlama-34B) outperform general LLMs (such as GPT-4) in code generation tasks, although GPT-4 remains competitive in other tasks. 3. **Which LLMs perform better in different software engineering tasks?** The paper summarizes the performance of 130 major Code LLMs in key benchmarks and organizes the evaluation methods used by different studies or proposes new benchmarks and evaluation metrics. The results show that different Code LLMs exhibit varying performance across various software engineering tasks, which helps developers choose the base model and fine-tuning methods more suitable for specific tasks. Through the above research, the paper systematically reviews the current status and future trends of Code LLMs, providing valuable insights for researchers and practitioners to improve and optimize the application of these models in the field of software engineering.