Debugging with Open-Source Large Language Models: An Evaluation

Yacine Majdoub,Eya Ben Charrada

DOI: https://doi.org/10.1145/3674805.3690758

2024-09-05

Abstract:Large language models have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many companies prohibit it due to strict code sharing policies. To address this, companies can run open-source LLMs locally. But until now there is not much research evaluating the performance of open-source large language models in debugging. This work is a preliminary evaluation of the capabilities of open-source LLMs in fixing buggy code. The evaluation covers five open-source large language models and uses the benchmark DebugBench which includes more than 4000 buggy code instances written in Python, Java and C++. Open-source LLMs achieved scores ranging from 43.9% to 66.6% with DeepSeek-Coder achieving the best score for all three programming languages.

Software Engineering

What problem does this paper attempt to address?

The purpose of this paper is to evaluate the performance of Open-Source Large Language Models (OSLLMs) in debugging code. Specifically: 1. **Research Question 1 (RQ1)**: How do open-source large language models perform in debugging tasks? To answer this question, the authors selected a benchmark called DebugBench, which contains over 4,000 instances of buggy code written in Python, Java, and C++, to evaluate five different open-source large language models. 2. **Research Question 2 (RQ2)**: How does the ability of open-source large language models in code generation affect their performance in debugging tasks? The authors compared the scores of these models on debugging tasks with their scores on the HumanEval benchmark. The research results indicate that although some open-source models do not perform as well as the most advanced closed-source models (such as GPT-4), certain smaller open-source models (such as DeepSeek-Coder) still achieved relatively good results. Additionally, except for DeepSeek-Coder, other models that performed better on HumanEval also scored higher on debugging tasks. This suggests that while models with stronger code generation capabilities may perform better in debugging tasks, the relationship between the two is not absolute.

Debugging with Open-Source Large Language Models: An Evaluation

DebugBench: Evaluating Debugging Capability of Large Language Models

MdEval: Massively Multilingual Code Debugging

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

Evaluating Language Models for Generating and Judging Programming Feedback

LDB: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

Effective Large Language Model Debugging with Best-first Tree Search

Fixing Code Generation Errors for Large Language Models

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

Investigating large language models capabilities for automatic code repair in Python

CodeJudge: Evaluating Code Generation with Large Language Models

Deploying Open-Source Large Language Models: A performance Analysis

On the Evaluation of Large Language Models in Unit Test Generation

Teaching Large Language Models to Self-Debug