Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Egor Bogomolov,Aleksandra Eliseeva,Timur Galimzyanov,Evgeniy Glukhov,Anton Shapkin,Maria Tigina,Yaroslav Golubev,Alexander Kovrigin,Arie van Deursen,Maliheh Izadi,Timofey Bryksin

2024-06-17

Abstract:Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: <a class="link-external link-https" href="https://huggingface.co/spaces/JetBrains-Research/long-code-arena" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Artificial Intelligence,Information Retrieval,Software Engineering

What problem does this paper attempt to address?

This paper introduces a new benchmark suite called "Long Code Arena" aimed at addressing the evaluation problem of long context models in code processing tasks. Currently, most benchmarks focus on individual files or methods, while Long Code Arena includes six code processing tasks that require project-level context, such as library foundation code generation, CI build fixing, project-level code completion, commit message generation, error localization, and module summarization. These tasks cover different aspects of code processing and each task provides manual validation datasets, evaluation tools, and open-source baseline solutions based on popular LLMs to facilitate future research. The paper points out that the existing benchmarks suffer from limited context lengths that do not reflect real-world software development cases, whereas Long Code Arena attempts to address this gap by providing larger-scale code processing tasks.

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Evaluating and Aligning CodeLLMs on Human Preference

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

InfiCoder-Eval: Systematically Evaluating the Question-Answering Capabilities of Code Large Language Models.

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

Marathon: A Race Through the Realm of Long Context with Large Language Models

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

StackEval: Benchmarking LLMs in Coding Assistance

GraphArena: Benchmarking Large Language Models on Graph Computational Problems

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

LongCoder: A Long-Range Pre-trained Language Model for Code Completion