Abstract:Experimental evaluations of software engineering innovations, e.g., tools and processes, often include human-subject studies as a component of a multi-pronged strategy to obtain greater generalizability of the findings. However, human-subject studies in our field are challenging, due to the cost and difficulty of finding and employing suitable subjects, ideally, professional programmers with varying degrees of experience. Meanwhile, large language models (LLMs) have recently started to demonstrate human-level performance in several areas. This paper explores the possibility of substituting costly human subjects with much cheaper LLM queries in evaluations of code and code-related artifacts. We study this idea by applying six state-of-the-art LLMs to ten annotation tasks from five datasets created by prior work, such as judging the accuracy of a natural language summary of a method or deciding whether a code change fixes a static analysis warning. Our results show that replacing some human annotation effort with LLMs can produce inter-rater agreements equal or close to human-rater agreement. To help decide when and how to use LLMs in human-subject studies, we propose model-model agreement as a predictor of whether a given task is suitable for LLMs at all, and model confidence as a means to select specific samples where LLMs can safely replace human annotators. Overall, our work is the first step toward mixed human-LLM evaluations in software engineering.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper explores whether large language models (LLMs) can replace some or all of the human work in software engineering manual annotation tasks. Specifically, the paper attempts to solve the following problems: 1. **Cost problem**: In software engineering research, evaluating the effectiveness of tools and techniques usually requires human - subject studies, which are both expensive and time - consuming. Recruiting and hiring professional programmers is costly, and low - cost participants such as students may not be able to provide generalizable results. Therefore, the paper explores the possibility of using LLMs to reduce these costs. 2. **Evaluation consistency problem**: A key issue in human - subject studies is inter - rater reliability. Different evaluators may make different judgments on the same task, leading to inconsistent results. By comparing the inter - rater agreement between LLMs and human evaluators, the paper explores whether LLMs can provide consistency comparable to that of human evaluators. 3. **Task suitability problem**: Not all software engineering tasks are suitable for LLMs to complete. By analyzing the model - model agreement and human - model agreement of different tasks, the paper proposes a method to determine which tasks are suitable for using LLMs. ### Main contributions 1. **First study**: This is the first systematic study of the application potential of LLMs in software engineering manual annotation tasks. 2. **Methodology**: A methodology is proposed to evaluate the performance of LLMs in different tasks and compare them with human evaluators. 3. **Task selection method**: A method based on model - model agreement is proposed to decide which tasks are suitable for using LLMs. 4. **Sample selection method**: A method based on model confidence is proposed to select specific samples that can be safely replaced by LLMs instead of human evaluators. ### Experimental design The paper selected five datasets covering ten annotation tasks, which involve code summarization, name - value inconsistency, causality, semantic similarity, and static analysis warnings, etc. The experiment used six state - of - the - art LLMs, including closed - source models (such as GPT - 4, Claude - 3.5 - Sonnet, Gemini - 1.5 - Pro) and open - source models (such as Llama3, Mixtral). ### Experimental results 1. **Consistency level**: - On some tasks, the model - human consistency of LLMs is equivalent to or close to the human - human consistency. For example, in the code summarization accuracy task, the human - human consistency is 0.38, while the model - human consistency is 0.48. - On other tasks, LLMs perform poorly. For example, in the static analysis warning task, the human - human consistency is 0.80, while the model - human consistency is only 0.15. 2. **Model - model consistency**: - There is a strong positive correlation between model - model consistency and model - human consistency (Spearman correlation coefficient is 0.65), which indicates that model - model consistency can be used as an indicator to judge whether a task is suitable for using LLMs. 3. **Sample selection**: - The confidence of the model (output probability) can help select samples that can be safely replaced by LLMs instead of human evaluators. For example, in the code summarization similarity task, delegating 50% of the human evaluation tasks to LLMs (selected based on model confidence) will not significantly change the overall consistency. ### Conclusion The research results of the paper show that LLMs can effectively replace human evaluators in some software engineering tasks, thereby reducing costs and time. However, for some tasks, the performance of LLMs is still not as good as that of humans. Through model - model consistency and model confidence, it is possible to better decide when and how to use LLMs.

Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels

Can Large Language Models Be an Alternative to Human Evaluations?

LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization

The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead?

Human-Centered Design Recommendations for LLM-as-a-Judge

An Empirical Study on the Potential of LLMs in Automated Software Refactoring

An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project

LLMs for science: Usage for code generation and data analysis

Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks

LLMs are Imperfect, Then What? An Empirical Study on LLM Failures in Software Engineering

State of Practice: LLMs in Software Engineering and Software Architecture

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Breaking the Silence: the Threats of Using LLMs in Software Engineering

Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

Using LLMs in Software Requirements Specifications: An Empirical Evaluation

On the Effectiveness of LLMs for Manual Test Verifications

The Potential of LLMs in Automating Software Testing: From Generation to Reporting

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

Evaluating Explanations Through LLMs: Beyond Traditional User Studies

LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs