Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

Luis Mayer,Christian Heumann,Matthias Aßenmacher
2024-09-06
Abstract:In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.
Computation and Language,Machine Learning,Software Engineering
What problem does this paper attempt to address?
The paper aims to evaluate the performance of five different large language models (LLMs) in the task of text-to-code generation. The researchers selected Bard, BingChat, ChatGPT, Llama2, and Code Llama, and tested their capabilities using programming problems from the coding website LeetCode. Specifically, the researchers input prompts containing programming problem descriptions into these models, asking them to generate Python code solutions. Subsequently, the quality of the generated code was assessed using LeetCode's testing functionality. The study found significant performance differences among the models, with ChatGPT performing the best in solving programming problems, even surpassing models specifically designed for code generation like Code Llama. Additionally, the study measured the runtime and memory usage of the generated code and compared it with other submitted code. Through a detailed analysis of error types, the researchers discovered a trend where the correctness of the generated code decreased as the prompt length increased. Overall, ChatGPT performed the best in generating correct code, followed by BingChat, while Llama2 and Code Llama performed poorly. Despite being a model specifically designed for code generation, Code Llama did not significantly outperform the base model Llama2.