Computer User Interface Understanding. A New Dataset and a Learning Framework

Andrés Muñoz,Daniel Borrajo
2024-08-28
Abstract:User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper primarily aims to address the issue of understanding computer user interfaces (Computer User Interface, UI) and proposes a new dataset and learning framework to achieve this goal. Specifically: 1. **Task Definition**: - The authors define a new UI understanding task, which involves understanding and representing the computer screen as a state. This includes identifying the currently running applications on the screen, application views, and interaction contexts (such as selected text). 2. **Dataset**: - To support research on this task, the authors created a new dataset called DataVisualWorkflow, which contains a series of video clips showing the process of users performing a series of operations on a computer. These videos are used to train models to understand the state of the computer UI. 3. **Learning Framework**: - A semi-supervised learning framework named UI Multi-task Contrastive Learning (UIMTCon) is proposed. This framework consists of two modules: one for generating synthetic samples to augment the dataset, and another embedding network for extracting features from images. 4. **Contributions**: - Introduced a new UI understanding task that treats the computer screen as a state to be understood. - Proposed a new framework for generating synthetic samples and learning representations of noisy inputs. - Created a new dataset to support research on unsupervised and semi-supervised learning methods. Through these efforts, the paper aims to advance the large-scale development of automated systems, particularly in observing how humans perform tasks on computers in enterprise environments and attempting to automate workers' workflows.