Debugging with Open-Source Large Language Models: An Evaluation

Yacine Majdoub,Eya Ben Charrada
DOI: https://doi.org/10.1145/3674805.3690758
2024-09-05
Abstract:Large language models have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many companies prohibit it due to strict code sharing policies. To address this, companies can run open-source LLMs locally. But until now there is not much research evaluating the performance of open-source large language models in debugging. This work is a preliminary evaluation of the capabilities of open-source LLMs in fixing buggy code. The evaluation covers five open-source large language models and uses the benchmark DebugBench which includes more than 4000 buggy code instances written in Python, Java and C++. Open-source LLMs achieved scores ranging from 43.9% to 66.6% with DeepSeek-Coder achieving the best score for all three programming languages.
Software Engineering
What problem does this paper attempt to address?
The purpose of this paper is to evaluate the performance of Open-Source Large Language Models (OSLLMs) in debugging code. Specifically: 1. **Research Question 1 (RQ1)**: How do open-source large language models perform in debugging tasks? To answer this question, the authors selected a benchmark called DebugBench, which contains over 4,000 instances of buggy code written in Python, Java, and C++, to evaluate five different open-source large language models. 2. **Research Question 2 (RQ2)**: How does the ability of open-source large language models in code generation affect their performance in debugging tasks? The authors compared the scores of these models on debugging tasks with their scores on the HumanEval benchmark. The research results indicate that although some open-source models do not perform as well as the most advanced closed-source models (such as GPT-4), certain smaller open-source models (such as DeepSeek-Coder) still achieved relatively good results. Additionally, except for DeepSeek-Coder, other models that performed better on HumanEval also scored higher on debugging tasks. This suggests that while models with stronger code generation capabilities may perform better in debugging tasks, the relationship between the two is not absolute.