ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Jia Feng,Jiachen Liu,Cuiyun Gao,Chun Yong Chong,Chaozheng Wang,Shan Gao,Xin Xia

DOI: https://doi.org/10.1145/3691620.3695552

2024-09-16

Abstract:In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which do not reflect the diverse challenges developers face in real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark designed to assess LCMs in various development tasks, including code generation, completion, API recommendation, and test case generation. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories, each annotated with function signatures, docstrings, and API references to simulate real development environments. Our experiments across ten LCMs reveal that context improves performance and that data leakage can lead to overestimation, highlighting the need for more accurate evaluations.

Software Engineering

What problem does this paper attempt to address?

The paper attempts to address the limitations of existing code evaluation benchmarks when assessing Large Code Models (LCMs). Specifically, the existing benchmarks have the following issues: 1. **Single Task**: Existing evaluation benchmarks usually focus on a specific task (such as code generation or code completion), whereas the actual development environment involves diverse tasks, including code generation, code completion, API recommendation, and test case generation. 2. **Limited Sample Sources**: Many benchmarks' samples come from manually constructed or a few code repositories, leading to a narrow application domain coverage and failing to effectively reflect the actual challenges in software development. 3. **Data Leakage**: Some benchmarks' samples might have already been included in the model's training data, which could lead to inaccurate evaluation results and overestimate the model's actual performance. To address these issues, the authors propose a new evaluation benchmark—ComplexCodeEval. This benchmark aims to: - **Cover Multiple Downstream Tasks**: It can evaluate LCMs' performance on different tasks, including code generation, code completion, API recommendation, and test case generation. - **Reflect Real Programming Environments**: By collecting samples from multiple high-starred GitHub repositories, it ensures the diversity and representativeness of the samples. - **Avoid Data Leakage**: By introducing timestamps, it ensures that the evaluation samples have not been seen by the model during training. With these improvements, ComplexCodeEval can more comprehensively evaluate the performance of LCMs in complex development environments.

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

McEval: Massively Multilingual Code Evaluation

Evaluating Large Language Models in Class-Level Code Generation

CodeJudge: Evaluating Code Generation with Large Language Models

DevEval: Evaluating Code Generation in Practical Software Projects

ExecRepoBench: Multi-level Executable Code Completion Evaluation

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

InfiCoder-Eval: Systematically Evaluating the Question-Answering Capabilities of Code Large Language Models.