LocateBench: Evaluating the Locating Ability of Vision Language Models

Ting-Rui Chiang,Joshua Robinson,Xinyan Velocity Yu,Dani Yogatama
2024-10-17
Abstract:The ability to locate an object in an image according to natural language instructions is crucial for many real-world applications. In this work we propose LocateBench, a high-quality benchmark dedicated to evaluating this ability. We experiment with multiple prompting approaches, and measure the accuracy of several large vision language models. We find that even the accuracy of the strongest model, GPT-4o, lags behind human accuracy by more than 10%.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the evaluation of Vision Language Models (VLMs) in their ability to locate objects in images based on natural language instructions. Although many studies have assessed the performance of VLMs in downstream tasks such as visual question answering and image captioning, there is little research directly measuring the localization capabilities of these models. Therefore, the authors propose a new benchmark dataset called LocateBench, specifically designed to evaluate the performance of VLMs in this critical task. Specifically, LocateBench consists of a series of multiple-choice questions, each requiring the model to select one of four candidate bounding boxes that best matches a given natural language description. In this way, the authors can systematically assess the accuracy of different VLMs in the localization task and compare it to human performance. Experimental results show that even the strongest models (such as GPT-4o) have an accuracy rate that is more than 10% lower than that of humans. This indicates that current VLMs still have significant room for improvement in object localization tasks.