Computer User Interface Understanding. A New Dataset and a Learning Framework

Andrés Muñoz,Daniel Borrajo

2024-08-28

Abstract:User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily aims to address the issue of understanding computer user interfaces (Computer User Interface, UI) and proposes a new dataset and learning framework to achieve this goal. Specifically: 1. **Task Definition**: - The authors define a new UI understanding task, which involves understanding and representing the computer screen as a state. This includes identifying the currently running applications on the screen, application views, and interaction contexts (such as selected text). 2. **Dataset**: - To support research on this task, the authors created a new dataset called DataVisualWorkflow, which contains a series of video clips showing the process of users performing a series of operations on a computer. These videos are used to train models to understand the state of the computer UI. 3. **Learning Framework**: - A semi-supervised learning framework named UI Multi-task Contrastive Learning (UIMTCon) is proposed. This framework consists of two modules: one for generating synthetic samples to augment the dataset, and another embedding network for extracting features from images. 4. **Contributions**: - Introduced a new UI understanding task that treats the computer screen as a state to be understood. - Proposed a new framework for generating synthetic samples and learning representations of noisy inputs. - Created a new dataset to support research on unsupervised and semi-supervised learning methods. Through these efforts, the paper aims to advance the large-scale development of automated systems, particularly in observing how humans perform tasks on computers in enterprise environments and attempting to automate workers' workflows.

Computer User Interface Understanding. A New Dataset and a Learning Framework

UIBert: Learning Generic Multimodal Representations for UI Understanding

Towards Better Semantic Understanding of Mobile Interfaces

Falcon-UI: Understanding GUI Before Following User Instructions

Tell Me What's Next: Textual Foresight for Generic UI Representations

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Harnessing Webpage UIs for Text-Rich Visual Understanding

ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations

Lexi: Self-Supervised Learning of the UI Language

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

UI Layers Merger:Merging UI Layers Via Visual Learning and Boundary Prior

UIClip: A Data-driven Model for Assessing User Interface Design

AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool

GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

UEyes: An Eye-Tracking Dataset across User Interface Types

MUD: Towards a Large-Scale and Noise-Filtered UI Dataset for Modern Style UI Modeling

UGIF: UI Grounded Instruction Following

Visual grounding for desktop graphical user interfaces

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Understanding Mobile GUI: from Pixel-Words to Screen-Sentences