Abstract:Ensuring usability is crucial for the success of mobile apps. Usability issues can compromise user experience and negatively impact the perceived app quality. This paper presents UX-LLM, a novel tool powered by a Large Vision-Language Model that predicts usability issues in iOS apps. To evaluate the performance of UX-LLM we predicted usability issues in two open-source apps of a medium complexity and asked usability experts to assess the predictions. We also performed traditional usability testing and expert review for both apps and compared the results to those of UX-LLM. UX-LLM demonstrated precision ranging from 0.61 and 0.66 and recall between 0.35 and 0.38, indicating its ability to identify valid usability issues, yet failing to capture the majority of issues. Finally, we conducted a focus group with an app development team of a capstone project developing a transit app for visually impaired persons. The focus group expressed positive perceptions of UX-LLM as it identified unknown usability issues in their app. However, they also raised concerns about its integration into the development workflow, suggesting potential improvements. Our results show that UX-LLM cannot fully replace traditional usability evaluation methods but serves as a valuable supplement particularly for small teams with limited resources, to identify issues in less common user paths, due to its ability to inspect the source code.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore whether generative artificial intelligence (GenAI) can replace traditional usability testing, especially in mobile application development. Specifically, the authors developed a new tool named UX - LLM, which utilizes the large vision - language model to predict usability problems in iOS applications. By comparing UX - LLM with traditional usability evaluation methods, the researchers hope to answer the following three main questions: 1. **RQ1: How accurate is UX - LLM in predicting usability problems?** - The researchers hope to measure its performance by evaluating the precision and recall rate of UX - LLM in identifying actual usability problems. 2. **RQ2: How do the problems predicted by UX - LLM compare with those identified by traditional usability evaluation methods? To what extent can it replace these traditional methods?** - By comparing the results of UX - LLM, expert reviews, and user tests, the researchers hope to understand the value and limitations of UX - LLM as an auxiliary tool. 3. **RQ3: How does the application development team view the supporting role of UX - LLM in the development process?** - Through a focus group discussion of an ongoing application development project, the researchers hope to understand the development team's views on UX - LLM, as well as the challenges they may encounter in actual work and suggestions for improvement. ### Main findings - **Accuracy**: The performance of UX - LLM on two open - source applications shows a certain degree of accuracy, but fails to capture most usability problems. Specifically, its precision ranges from 0.61 to 0.66, and its recall rate is from 0.35 to 0.38. - **Supplement rather than replacement**: The research shows that although UX - LLM can identify valid usability problems, it cannot completely replace traditional usability evaluation methods. Instead, it serves as a valuable supplementary tool, especially suitable for small development teams with limited resources, helping them identify problems in uncommon user paths. - **Development team feedback**: The feedback from the focus group shows that the development team has a positive attitude towards UX - LLM because it can identify usability problems that were previously unnoticed. However, they also raised concerns about how to integrate it into the existing development workflow and suggested some potential improvement directions. Overall, this paper demonstrates the potential of GenAI in automating mobile application usability evaluation, while also emphasizing its importance as an auxiliary tool rather than a complete replacement for traditional methods.

Does GenAI Make Usability Testing Obsolete?

SimUser: Generating Usability Feedback by Simulating Various Users Interacting with Mobile Applications

Predicting the usability of mobile applications using AI tools: the rise of large user interface models, opportunities, and challenges

Usability Evaluation of Augmented Reality: A Neuro-Information-Systems Study.

Large Language Models for Mobile GUI Text Input Generation: An Empirical Study

The fine line between automation and augmentation in website usability evaluation

Design and Implementation of a Toolkit for Usability Testing of Mobile Apps

Establishing Heuristics for Improving the Usability of GUI Machine Learning Tools for Novice Users

AXNav: Replaying Accessibility Tests from Natural Language

Enhancing UX Research Activities Using GenAI -- Potential Applications and Challenges

Evaluating the Usability of LLMs in Threat Intelligence Enrichment

On AI-Inspired UI-Design

Mobile language learning applications for Arabic speaking migrants – a usability perspective

User Guided Automation for Testing Mobile Apps

A Genetic Algorithm-Based Support Vector Machine Approach for Intelligent Usability Assessment of m-Learning Applications

In Situ AI Prototyping: Infusing Multimodal Prompts into Mobile Settings with MobileMaker

Automated Test Transfer Across Android Apps Using Large Language Models

EvAlignUX: Advancing UX Research through LLM-Supported Exploration of Evaluation Metrics

Human-AI Collaboration for UX Evaluation: Effects of Explanation and Synchronization

Addressing UX Practitioners' Challenges in Designing ML Applications: an Interactive Machine Learning Approach

Understanding the Usability of AI Programming Assistants