ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

Jia Feng,Jiachen Liu,Cuiyun Gao,Chun Yong Chong,Chaozheng Wang,Shan Gao,Xin Xia
DOI: https://doi.org/10.1145/3691620.3695552
2024-09-16
Abstract:In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which do not reflect the diverse challenges developers face in real-world contexts. To address this, we introduce ComplexCodeEval, a benchmark designed to assess LCMs in various development tasks, including code generation, completion, API recommendation, and test case generation. It includes 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories, each annotated with function signatures, docstrings, and API references to simulate real development environments. Our experiments across ten LCMs reveal that context improves performance and that data leakage can lead to overestimation, highlighting the need for more accurate evaluations.
Software Engineering
What problem does this paper attempt to address?
The paper attempts to address the limitations of existing code evaluation benchmarks when assessing Large Code Models (LCMs). Specifically, the existing benchmarks have the following issues: 1. **Single Task**: Existing evaluation benchmarks usually focus on a specific task (such as code generation or code completion), whereas the actual development environment involves diverse tasks, including code generation, code completion, API recommendation, and test case generation. 2. **Limited Sample Sources**: Many benchmarks' samples come from manually constructed or a few code repositories, leading to a narrow application domain coverage and failing to effectively reflect the actual challenges in software development. 3. **Data Leakage**: Some benchmarks' samples might have already been included in the model's training data, which could lead to inaccurate evaluation results and overestimate the model's actual performance. To address these issues, the authors propose a new evaluation benchmark—ComplexCodeEval. This benchmark aims to: - **Cover Multiple Downstream Tasks**: It can evaluate LCMs' performance on different tasks, including code generation, code completion, API recommendation, and test case generation. - **Reflect Real Programming Environments**: By collecting samples from multiple high-starred GitHub repositories, it ensures the diversity and representativeness of the samples. - **Avoid Data Leakage**: By introducing timestamps, it ensures that the evaluation samples have not been seen by the model during training. With these improvements, ComplexCodeEval can more comprehensively evaluate the performance of LCMs in complex development environments.