Improved GUI Grounding via Iterative Narrowing

Anthony Nguyen
2024-11-18
Abstract:GUI grounding, the task of identifying a precise location on an interface image from a natural language query, plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for one-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework called Iterative Narrowing (IN) to further enhance the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising different UI platforms.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the performance of Vision - Language Model (VLM) in GUI Grounding tasks. Specifically, GUI Grounding refers to precisely locating the position of target elements in a user - interface image according to natural - language queries. Although existing general VLMs (such as GPT - 4V) perform well in a variety of vision - language tasks, their performance in GUI Grounding tasks is still not satisfactory. To solve this problem, the author proposes a visual - prompt framework named **Iterative Narrowing (IN)**. This framework gradually improves the model's positioning accuracy of the target location by iteratively narrowing the prediction area. The specific methods are as follows: 1. **Initial prediction**: The model first predicts an initial position according to the input image and natural - language query. 2. **Iterative refinement**: Based on the initially predicted position, a new cropping area is generated and this area is used as the input for the next prediction. This process can be repeated multiple times, and each iteration will further narrow the prediction area, thereby gradually improving the prediction accuracy. 3. **Final prediction**: After the last iteration, the predicted coordinates are converted into coordinates relative to the original image as the final positioning result. To evaluate the effectiveness of this method, the author conducted tests on the ScreenSpot benchmark dataset. The results show that the Iterative Narrowing method can significantly improve the performance of multiple VLMs in GUI Grounding tasks, especially when dealing with general VLMs. However, this method has limitations when dealing with targets that need to rely on long - distance context information. Future work can focus on how to better maintain global and local context information. ### Summary The main contribution of this paper is the proposal of the Iterative Narrowing method, which effectively improves the performance of VLM in GUI Grounding tasks by iteratively narrowing the prediction area.