ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis

Hiroyuki Kirinuki,Haruto Tanno

2024-01-25

Abstract:In recent years, large language models (LLMs), such as ChatGPT, have been pivotal in advancing various artificial intelligence applications, including natural language processing and software engineering. A promising yet underexplored area is utilizing LLMs in software testing, particularly in black-box testing. This paper explores the test cases devised by ChatGPT in comparison to those created by human participants. In this study, ChatGPT (GPT-4) and four participants each created black-box test cases for three applications based on specifications written by the authors. The goal was to evaluate the real-world applicability of the proposed test cases, identify potential shortcomings, and comprehend how ChatGPT could enhance human testing strategies. ChatGPT can generate test cases that generally match or slightly surpass those created by human participants in terms of test viewpoint coverage. Additionally, our experiments demonstrated that when ChatGPT cooperates with humans, it can cover considerably more test viewpoints than each can achieve alone, suggesting that collaboration between humans and ChatGPT may be more effective than human pairs working together. Nevertheless, we noticed that the test cases generated by ChatGPT have certain issues that require addressing before use.

Software Engineering

What problem does this paper attempt to address?

The paper aims to explore the application of large language models (such as ChatGPT) in black-box testing and to evaluate their practical effectiveness by comparing the generated test cases with those created by human participants. Specifically, the paper addresses the following issues: 1. **Effectiveness of Test Cases**: Evaluating the effectiveness of test cases generated by ChatGPT in practical applications. 2. **Coverage of Testing Perspectives**: Identifying the testing perspectives that ChatGPT tends to overlook when creating test cases and comparing them with the perspectives of human testers. 3. **Collaborative Benefits**: Exploring the potential for collaboration between ChatGPT and human testers, and whether such collaboration can more effectively cover testing perspectives than a team of human testers alone. Through experiments, researchers had ChatGPT (GPT-4 version) and four human participants generate black-box test cases for three applications (password strength checker, unit converter, and budget planner). The experimental results indicate that the test cases generated by ChatGPT are comparable to or slightly better than those of human participants in terms of coverage of testing perspectives. However, there are still some areas that need improvement, particularly in testing perspectives related to boundary values and maximum/minimum values. Additionally, the study found that collaboration between ChatGPT and human testers can significantly improve the coverage of testing perspectives.

ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis

Exploring the Capability of ChatGPT in Test Generation

Can ChatGPT advance software testing intelligence? An experience report on metamorphic testing

Is ChatGPT the Ultimate Programming Assistant -- How far is it?

Evaluating and Improving ChatGPT for Unit Test Generation

The advantages and limitations of using ChatGPT to enhance technological research

No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Unveiling Assumptions: Exploring the Decisions of AI Chatbots and Human Testers

Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks?

Can ChatGPT Assess Human Personalities? A General Evaluation Framework

Finding Failure-Inducing Test Cases with ChatGPT

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT

Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools

"Will I be replaced?" Assessing ChatGPT's effect on software development and programmer perceptions of AI tools

Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity

ChatGPTest: opportunities and cautionary tales of utilizing AI for questionnaire pretesting

ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text

ChatGPT: A Study on its Utility for Ubiquitous Software Engineering Tasks

Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation

Can ChatGPT Play the Role of a Teaching Assistant in an Introductory Programming Course?