Abstract:Large Language Models have shown prominent capabilities in generating functional code from natural language descriptions. However, a standardized way to evaluate these capabilities in an objective and unbiased manner is still to be found. In this paper we review the current evaluation methods available to this end, and run a new evaluation of the performance of one state-of-the-art model (GPT4-o-mini) in solving curated coding challenges in 8 programming languages, obtained from Codewars, a software development community. Our analysis shows that the chance of success of the model has a positive correlation with the task difficulty, the popularity of the programming language being used and the time elapsed since the publication of the challenge. A further approximate explanatory analysis in terms of high-level features hints that while 46.6% of the model performance could be attributed to task difficulty, a 37.4% seems to be related to leakage of the challenge solutions into the model training set, while the remaining 16% depends on the programming language. These results suggest that current evaluation methodologies might be overestimating the actual skill of Large Language Models for generating functional code.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to evaluate the performance of large - language models (LLMs) in code - generation tasks and proposes a method to objectively and fairly measure the capabilities of these models through hierarchical programming exercises. Specifically, the paper addresses the following key issues: 1. **Lack of standardized evaluation methods**: - Currently, there is no standardized way to objectively and unbiasedly evaluate the ability of large - language models to generate functional code from natural - language descriptions. 2. **Limitations of existing evaluation methods**: - Many existing evaluation methods may have the problem of data leakage, that is, the test data overlaps with the training data, leading to an overestimation of the model's performance. - Existing evaluation methods may not fully consider the impact of factors such as task difficulty, the popularity of programming languages, and the time of challenge release on the model's performance. 3. **Explore factors affecting model performance**: - The paper reveals the possible biases in current evaluation methods by analyzing the impact of different factors (such as task difficulty, the popularity of programming languages, the time of challenge release, etc.) on the model's performance. - Specifically, the paper finds that task difficulty accounts for 46.6% of the model - performance differences, while the leakage of challenge solutions into the training set accounts for 37.4% of the differences, and the remaining 16% depends on the programming language. ### Main contributions of the paper - **Propose a new evaluation framework**: By using programming challenges from the Codewars community, the authors design a new evaluation framework to more comprehensively evaluate the performance of LLMs in code - generation tasks. - **Reveal potential biases**: The research results show that current evaluation methods may overestimate the actual code - generation capabilities of LLMs, especially due to factors such as data leakage and task difficulty. - **Provide improvement directions**: Based on the research results, the paper provides valuable insights and suggestions for how to more accurately evaluate the code - generation capabilities of LLMs in the future. ### Conclusion Through systematic experiments and analysis, this paper reveals the problems in current methods for evaluating the code - generation capabilities of LLMs and proposes directions for improvement. This helps to promote further development in this field and makes future evaluations more fair and reliable.

An evaluation of LLM code generation capabilities through graded exercises

Examination of Code generated by Large Language Models

Evaluating Large Language Models in Class-Level Code Generation

CodeJudge: Evaluating Code Generation with Large Language Models

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

Evaluating Language Models for Generating and Judging Programming Feedback

Evaluation of the Programming Skills of Large Language Models

A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

Enabling Programming Thinking in Large Language Models Toward Code Generation

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

Assessing Code Generation with Intermediate Languages

Large Language Models as Code Executors: An Exploratory Study

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Framework for evaluating code generation ability of large language models

The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation

Evaluating the Performance of Large Language Models in Competitive Programming: A Multi-Year, Multi-Grade Analysis

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

A Survey on Evaluating Large Language Models in Code Generation Tasks

Optimizing Large Language Model Hyperparameters for Code Generation