Abstract:GUI grounding, the task of identifying a precise location on an interface image from a natural language query, plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for one-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework called Iterative Narrowing (IN) to further enhance the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising different UI platforms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the performance of Vision - Language Model (VLM) in GUI Grounding tasks. Specifically, GUI Grounding refers to precisely locating the position of target elements in a user - interface image according to natural - language queries. Although existing general VLMs (such as GPT - 4V) perform well in a variety of vision - language tasks, their performance in GUI Grounding tasks is still not satisfactory. To solve this problem, the author proposes a visual - prompt framework named **Iterative Narrowing (IN)**. This framework gradually improves the model's positioning accuracy of the target location by iteratively narrowing the prediction area. The specific methods are as follows: 1. **Initial prediction**: The model first predicts an initial position according to the input image and natural - language query. 2. **Iterative refinement**: Based on the initially predicted position, a new cropping area is generated and this area is used as the input for the next prediction. This process can be repeated multiple times, and each iteration will further narrow the prediction area, thereby gradually improving the prediction accuracy. 3. **Final prediction**: After the last iteration, the predicted coordinates are converted into coordinates relative to the original image as the final positioning result. To evaluate the effectiveness of this method, the author conducted tests on the ScreenSpot benchmark dataset. The results show that the Iterative Narrowing method can significantly improve the performance of multiple VLMs in GUI Grounding tasks, especially when dealing with general VLMs. However, this method has limitations when dealing with targets that need to rely on long - distance context information. Future work can focus on how to better maintain global and local context information. ### Summary The main contribution of this paper is the proposal of the Iterative Narrowing method, which effectively improves the performance of VLM in GUI Grounding tasks by iteratively narrowing the prediction area.

Improved GUI Grounding via Iterative Narrowing

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Visual grounding for desktop graphical user interfaces

VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning

MUG: Interactive Multimodal Grounding on User Interfaces

Learning to Ground Visual Objects for Visual Dialog

Ponder & Press: Advancing Visual GUI Agent towards General Computer Control

GUICourse: From General Vision Language Models to Versatile GUI Agents

Towards Unified Interactive Visual Grounding in The Wild

V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

Uncovering the Full Potential of Visual Grounding Methods in VQA

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

ReGround: Improving Textual and Spatial Grounding at No Cost

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent