Abstract:Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper explores whether large language models (LLMs) can replace some or all of the human work in software engineering manual annotation tasks. Specifically, the paper attempts to solve the following problems:
1. **Cost problem**: In software engineering research, evaluating the effectiveness of tools and techniques usually requires human - subject studies, which are both expensive and time - consuming. Recruiting and hiring professional programmers is costly, and low - cost participants such as students may not be able to provide generalizable results. Therefore, the paper explores the possibility of using LLMs to reduce these costs.
2. **Evaluation consistency problem**: A key issue in human - subject studies is inter - rater reliability. Different evaluators may make different judgments on the same task, leading to inconsistent results. By comparing the inter - rater agreement between LLMs and human evaluators, the paper explores whether LLMs can provide consistency comparable to that of human evaluators.
3. **Task suitability problem**: Not all software engineering tasks are suitable for LLMs to complete. By analyzing the model - model agreement and human - model agreement of different tasks, the paper proposes a method to determine which tasks are suitable for using LLMs.
### Main contributions
1. **First study**: This is the first systematic study of the application potential of LLMs in software engineering manual annotation tasks.
2. **Methodology**: A methodology is proposed to evaluate the performance of LLMs in different tasks and compare them with human evaluators.
3. **Task selection method**: A method based on model - model agreement is proposed to decide which tasks are suitable for using LLMs.
4. **Sample selection method**: A method based on model confidence is proposed to select specific samples that can be safely replaced by LLMs instead of human evaluators.
### Experimental design
The paper selected five datasets covering ten annotation tasks, which involve code summarization, name - value inconsistency, causality, semantic similarity, and static analysis warnings, etc. The experiment used six state - of - the - art LLMs, including closed - source models (such as GPT - 4, Claude - 3.5 - Sonnet, Gemini - 1.5 - Pro) and open - source models (such as Llama3, Mixtral).
### Experimental results
1. **Consistency level**:
- On some tasks, the model - human consistency of LLMs is equivalent to or close to the human - human consistency. For example, in the code summarization accuracy task, the human - human consistency is 0.38, while the model - human consistency is 0.48.
- On other tasks, LLMs perform poorly. For example, in the static analysis warning task, the human - human consistency is 0.80, while the model - human consistency is only 0.15.
2. **Model - model consistency**:
- There is a strong positive correlation between model - model consistency and model - human consistency (Spearman correlation coefficient is 0.65), which indicates that model - model consistency can be used as an indicator to judge whether a task is suitable for using LLMs.
3. **Sample selection**:
- The confidence of the model (output probability) can help select samples that can be safely replaced by LLMs instead of human evaluators. For example, in the code summarization similarity task, delegating 50% of the human evaluation tasks to LLMs (selected based on model confidence) will not significantly change the overall consistency.
### Conclusion
The research results of the paper show that LLMs can effectively replace human evaluators in some software engineering tasks, thereby reducing costs and time. However, for some tasks, the performance of LLMs is still not as good as that of humans. Through model - model consistency and model confidence, it is possible to better decide when and how to use LLMs.