CodeJudge: Evaluating Code Generation with Large Language Models

Weixi Tong,Tianyi Zhang
2024-10-03
Abstract:Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing "slow thinking" to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub <a class="link-external link-https" href="https://github.com/VichyTong/CodeJudge" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language,Software Engineering
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of reliable evaluation of code generated by large - language models (LLMs). Specifically, existing methods face the following challenges when evaluating code generated by LLMs: 1. **Dependence on test cases**: Many existing evaluation methods rely on manually - written test cases to evaluate code quality. However, many tasks do not provide test cases, or the provided test cases are insufficient and cannot cover all boundary cases. 2. **Grammar variants**: The generated code may be grammatically different from the correct code but semantically equivalent. For example, using a `while` loop instead of a `for` loop, or using different variable naming conventions. 3. **Multiple - solution problems**: For some code - generation tasks, there may be multiple different solutions. For example, the task of sorting integers can be implemented using multiple different sorting algorithms. 4. **Partial correctness**: Code generated by LLMs is often partially correct. Although these codes are not completely correct, they can serve as a starting point for developers or provide some inspiration. To solve these problems, the paper proposes a code - evaluation framework named **CODEJUDGE**. This framework uses LLMs to evaluate the semantic correctness of the generated code without relying on test cases. CODEJUDGE realizes a more in - depth and reliable evaluation by guiding LLMs to perform "slow thinking", that is, analyzing code functions step by step. Specifically, CODEJUDGE supports two evaluation types: 1. **Binary evaluation**: Determine whether the generated code is correct. 2. **Deviation evaluation**: Estimate the degree of deviation between the generated code and the code of the user's intention. ### Main contributions 1. **Evaluation framework**: Proposed a code - evaluation framework CODEJUDGE based on LLMs, which can evaluate the semantic correctness of code without test cases. 2. **Evaluation methods**: Designed two methods to guide LLMs to perform "slow thinking" to improve the reliability and accuracy of evaluation. 3. **Experimental verification**: Conducted extensive experiments on five programming languages (Java, C++, Python, JavaScript, Go) and four datasets (HumanEval - X, CoNaLa, APPS, BigCodeBench) to verify the effectiveness of CODEJUDGE. 4. **Performance comparison**: Compared with nine existing methods, and the results show that CODEJUDGE significantly outperforms existing methods in most settings and can achieve better results even with a smaller model. ### Experimental results - **Statistical correlation**: On the HumanEval - X and CoNaLa datasets, the Spearman correlation coefficients of CODEJUDGE are 0.612 and 0.562 respectively, indicating that its evaluation results are highly correlated with the real labels. - **Accuracy**: In the binary evaluation task, the average accuracy of CODEJUDGE is 80.56%, which is significantly higher than other methods. - **No - reference code**: Even in the absence of reference code, CODEJUDGE can still achieve reasonable performance. For example, the Kendall’s τ coefficient on the HumanEval - X dataset is 0.502, and the accuracy is 73.13%. ### Conclusion The paper successfully solves the problem of reliable evaluation of code generated by LLMs by proposing the CODEJUDGE framework, especially in the absence of test cases. CODEJUDGE improves the accuracy and reliability of evaluation by guiding LLMs to perform "slow thinking", providing new ideas for the evaluation of code - generation tasks.