Lens : A System for Visual Interpretation of Graphical User Interfaces
Kevin A. Gibbs,T. Winograd,Neil Scott
Abstract:Lens enables natural segmentation of graphical user interfaces. Lens is a system that automatically discovers the structure of user interfaces by discovering the hierarchically repeating on-screen elements which structure interfaces. The Lens system was built to facilitate access to modern GUI based computer systems for blind computer users by extracting useful interface information from the visual representation of the computer screen. This paper documents the exploration and design leading to the Lens system, as well as describing the algorithms and our current implementation of the Lens system. Our evaluation of the implemented Lens system in real-world situations shows that a system with no prior information about a specific interface implementation can discover useful and usable interface control structure. INTRODUCTION In the two decades since their inception, Graphical User Interfaces have gradually become the de facto method of accessing nearly all computer systems. To most users, the thought of doing common computing tasks, such as editing a word processing document or browsing the Internet, from a command line interface, now seems as impossible as it does unlikely. Most personal computing tasks today are so dependant upon the common denominator of a GUI, a graphical representation of data and controls, and a visual display, that the thought of trying to get anything done on a computer without the use of them seems roughly equivalent to trying to navigate through an unfamiliar house with your eyes closed—and your hands tied behind your back. Which is exactly the computer access problem facing the world’s burgeoning population of over 38 million blind people [3]. Although significant efforts have been in the areas of enabling limited applications (such as text processors like EMACS [1] and screen-reading software for the World-Wide Web [2]) for the blind, few if any successful efforts have been made in the area of providing access to the generalized GUI systems that the rest of us use on a day-to-day basis. Admittedly, this problem seems quite intractable. How could a blind user ever be expected to make use of a GUI system, which is at its very core, dependent upon a visual metaphor? Enter VisualTAS, or the Visual Total Access System. VisualTAS is a system that is under development by our group at Project Archimedes [4] that attempts to make the GUI accessible to blind users (and other physically disabled users) by breaking the screen image down into a series of objects that can be represented to the blind user by a variety of aural, sonic, and physical interfaces. The VisualTAS has two components: a VisualTAP (VTAP) that captures screen images and extracts a variety of categories of information from these images, and a GUI Accessor that presents each type of information to the blind user in the most appropriate form available [5]. A selection of GUI Accessors will then interact with the user, based on speech, sound, touch, and haptic (force feedback) representations of this information. One example of such a system is The Moose, a haptic interface for feeling visual graphical interface information developed in our lab, which is typical of the type of GUI Accessor that a blind user would interact with [12]. The VisualTAP, or Visual Total Access Port, which is the portion of the VisualTAS responsible for extracting data and understanding control information from the GUI, has its work cut out for itself. As the sole source of input available to the VisualTAS, the VTAP must extract enough information from the GUI system to provide a usable system to the blind user. To follow our earlier metaphor, the VTAP is what “unties” the blind user’s hand, and allows them to navigate through a visual interface by interacting with the extracted information through a variety of other sensory means. The overall task of the VTAP, then, is to extract all useful forms of information possible from the GUI system. This task clearly is composed of many subtasks, some of which are more straightforward than others. For example, text and characters found on the screen should be extracted and output the eventual use of speech-synthesis components. Icons onscreen should be discovered and have visual Figure 1. The Moose, a haptic interface [from 12]. features extracted from them, so that they can be somehow presented in a non-visual form. Yet these tasks, while quite interesting and significantly difficult, do not broach what is perhaps the single most complex and fundamental problem posed to a system like the VTAP—finding a way to extract and in some way interpret the user interface presented to the user by a GUI. We needed to find some system that could do this, in order to make the VisualTAS a reality. This problem of gleaning usable interface information from a GUI system appears to be a fascinating and exciting problem for investigation. However, the programmer attempting to actually implement a suitable system for extracting GUI information for a project like the VisualTAS has many initial hurdles to face. 1. The visual descriptions of GUIs vary widely from one operating system and revision to another. Although some visual metaphors are relatively standard (the concept of a “button,” a “scroll bar,” a “window,” ad etc.), the actual visual implementations of these metaphors vary widely. Every OS has it’s own “look and feel,” and this look and feel even changes amongst mainstream OS revisions, such as Microsoft Windows 2000 and XP, and Apple Macintosh OS 9 and X. The frightening amount of variety found amongst the bevy of UNIX window managers and interface themes can be left to the reader’s personal nightmares. 2. The function and location of user interface elements within a system varies in a similar manner. For example, consider how many different ways GUI systems deal with the problem of a window close box. Popular flavors of Microsoft Windows, the Apple Macintosh OS, and UNIX, all use widely different window controls, which vary in not only visual representation (“look and feel”), but also in spatial location. 3. Interactions in software between programs and the visual GUI are very intricate and highly brittle. Operating systems, which ultimate offer GUI services to programs, are amongst the most complex and rapidly changing pieces of software on computers today. Furthermore, the interactions between programs, the OS, and the visual elements that are finally painted on the screen are ill defined, highly implementation dependent, and generally not meant for outside programs to intercept. Any software that actually runs on the host machine and attempts to intercept information from these interactions will likely be similarly highly complex, error prone, and require constant support and revision. 4. Costs of development, of any form, are high. Projects like the VisualTAS are still relatively young and have limited resources and manpower available. Even with available resources, it is quite difficult to find systems programmers of the caliber necessary to solve the highly complex problems that developing a GUI interpretation system like the VTAP requires. Moreover, the developers that are available will likely tend to be spread quite thin amongst the numerous other projects involved in creating a VisualTAS system. Hence, the developer costs of maintaining of any developed system cannot be ignored, as a system that requires constant programmatic upkeep may well become derelict shortly. MOTIVATION: OUR FRUSTRATING EXPERIENCES WITH CURRENT SOLUTIONS In our search for a viable solution for extracting GUI information in the VTAP, our experiences all pointed back to the problems outlined above. The various systems that we found that attempted to solve this general problem, that is, of extracting useful control information from a GUI, all suffered from the above problems in various forms. API “shims.” We first investigated systems that attempt to track and interpret all the internal software interactions that ultimately result in the onscreen image. Systems like this typically insert a software “shunt” or wrapper around GUI libraries like Microsoft’s MFC, which is responsible for drawing most of the user interface controls in a Microsoft Windows GUI. However, we found that these systems were, as we predicted above, very intricate and very brittle. Every time any major or minor change to the OS or GUI libraries occurred, the software would stop working altogether, and the system would become useless without a great deal of additional development and fixing. Moreover, even when these systems work, they are (obviously) very OS and revision dependent, as an MFC API wrapper provides no use on say, an MacOS system, which uses completely different set of APIs and libraries for its GUI, or on the new version of Windows, which alters the underlying API the system is trying to wrap. Thus, we did not find these systems to be viable due to their extreme brittleness, system dependence, and high development costs, though their accurate and complete information otherwise seemed promising. Scripting systems. The next, slightly different approach we investigated were OS-level scripting systems. These systems are usually part of the operating system itself, such as AppleScript, or run as a 3 party application, like the classic QuicKeys. These systems did present a fair number of commands and options for manipulating the GUI, and they also usually have a fair model of what is available on screen for use, such as the available windows, and open programs. However, we found that these systems usually suffered from the same brittleness (albeit, to a lesser extent) Figure 2. Variety in window controls. of the “shim” systems, in that the features and options available and implementations were constantly changing amongst revisions and particular programs. These scripting systems were also very dependent on per-program support, which varied widely and was usua