Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe,Satya Narayan Shukla,Omid Poursaeed,Michael S. Ryoo,Tsung-Yu Lin
2024-04-11
Abstract:Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of existing vision - language models (V - LLMs) in spatial reasoning and localization awareness. Although these models can generate highly descriptive and detailed text answers, they perform poorly in handling simple tasks, such as distinguishing whether an object is on the left or on the right. By exploring the instruction - fine - tuning objective based on the spatial coordinates of images, the paper aims to inject spatial awareness into V - LLMs, thereby improving their performance in visual question - answering (VQA) tasks, reducing the undesirable hallucination phenomenon, and generating better context - object descriptions. Specifically, the paper proposes three new position - based instruction - fine - tuning objectives to optimize the coordinate representation form, generate pseudo - data, and extend to video - domain operations. These methods aim to improve the model's spatial reasoning ability, especially showing significant performance improvements in multiple vision - language tasks in the image and video domains. Through these improvements, the authors of the paper hope that their model (called LocVLM) can overcome the limitations of existing V - LLMs in spatial localization and understanding while maintaining the original advantages.