Guardian: A Runtime Framework for LLM-based UI Exploration

Dezhi Ran,Hao Wang,Zihe Song,Mengzhou Wu,Yuan Cao,Ying Zhang,Wei Yang,Tao Xie
DOI: https://doi.org/10.1145/3650212.3680334
2024-01-01
Abstract:Tests for feature-based UI testing have been indispensable for ensuring the quality of mobile applications (apps for short). The high manual labor costs to create such tests have led to a strong interest in automated feature-based UI testing, where an approach automatically explores the App under Test (AUT) to find correct sequences of UI events achieving the target test objective, given only a high-level test objective description. Given that the task of automated feature-based UI testing resembles conventional AI planning problems, large language models (LLMs), known for their effectiveness in AI planning, could be ideal for this task. However, our study reveals that LLMs struggle with following specific instructions for UI testing and replanning based on new information. This limitation results in reduced effectiveness of LLM-driven solutions for automated feature-based UI testing, despite the use of advanced prompting techniques. Toward addressing the preceding limitation, we propose Guardian, a runtime system framework to improve the effectiveness of automated feature-based UI testing by offloading computational tasks from LLMs with two major strategies. First, Guardian refines UI action space that the LLM can plan over, enforcing the instruction following of the LLM by construction. Second, Guardian deliberately checks whether the gradually enriched information invalidates previous planning by the LLM. Guardian removes the invalidated UI actions from the UI action space that the LLM can plan over, restores the state of the AUT to the state before the execution of the invalidated UI actions, and prompts the LLM to re-plan with the new UI action space. We instantiate Guardian with ChatGPT and construct a benchmark named FestiVal with 58 tasks from 23 highly popular apps. Evaluation results on FestiVal show that Guardian achieves 48.3
What problem does this paper attempt to address?