ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis

Hiroyuki Kirinuki,Haruto Tanno
2024-01-25
Abstract:In recent years, large language models (LLMs), such as ChatGPT, have been pivotal in advancing various artificial intelligence applications, including natural language processing and software engineering. A promising yet underexplored area is utilizing LLMs in software testing, particularly in black-box testing. This paper explores the test cases devised by ChatGPT in comparison to those created by human participants. In this study, ChatGPT (GPT-4) and four participants each created black-box test cases for three applications based on specifications written by the authors. The goal was to evaluate the real-world applicability of the proposed test cases, identify potential shortcomings, and comprehend how ChatGPT could enhance human testing strategies. ChatGPT can generate test cases that generally match or slightly surpass those created by human participants in terms of test viewpoint coverage. Additionally, our experiments demonstrated that when ChatGPT cooperates with humans, it can cover considerably more test viewpoints than each can achieve alone, suggesting that collaboration between humans and ChatGPT may be more effective than human pairs working together. Nevertheless, we noticed that the test cases generated by ChatGPT have certain issues that require addressing before use.
Software Engineering
What problem does this paper attempt to address?
The paper aims to explore the application of large language models (such as ChatGPT) in black-box testing and to evaluate their practical effectiveness by comparing the generated test cases with those created by human participants. Specifically, the paper addresses the following issues: 1. **Effectiveness of Test Cases**: Evaluating the effectiveness of test cases generated by ChatGPT in practical applications. 2. **Coverage of Testing Perspectives**: Identifying the testing perspectives that ChatGPT tends to overlook when creating test cases and comparing them with the perspectives of human testers. 3. **Collaborative Benefits**: Exploring the potential for collaboration between ChatGPT and human testers, and whether such collaboration can more effectively cover testing perspectives than a team of human testers alone. Through experiments, researchers had ChatGPT (GPT-4 version) and four human participants generate black-box test cases for three applications (password strength checker, unit converter, and budget planner). The experimental results indicate that the test cases generated by ChatGPT are comparable to or slightly better than those of human participants in terms of coverage of testing perspectives. However, there are still some areas that need improvement, particularly in testing perspectives related to boundary values and maximum/minimum values. Additionally, the study found that collaboration between ChatGPT and human testers can significantly improve the coverage of testing perspectives.