Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe,Satya Narayan Shukla,Omid Poursaeed,Michael S. Ryoo,Tsung-Yu Lin

2024-04-11

Abstract:Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing vision - language models (V - LLMs) in spatial reasoning and localization awareness. Although these models can generate highly descriptive and detailed text answers, they perform poorly in handling simple tasks, such as distinguishing whether an object is on the left or on the right. By exploring the instruction - fine - tuning objective based on the spatial coordinates of images, the paper aims to inject spatial awareness into V - LLMs, thereby improving their performance in visual question - answering (VQA) tasks, reducing the undesirable hallucination phenomenon, and generating better context - object descriptions. Specifically, the paper proposes three new position - based instruction - fine - tuning objectives to optimize the coordinate representation form, generate pseudo - data, and extend to video - domain operations. These methods aim to improve the model's spatial reasoning ability, especially showing significant performance improvements in multiple vision - language tasks in the image and video domains. Through these improvements, the authors of the paper hope that their model (called LocVLM) can overcome the limitations of existing V - LLMs in spatial localization and understanding while maintaining the original advantages.

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Locality Alignment Improves Vision-Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning

Pixel Aligned Language Models

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Enhancing Advanced Visual Reasoning Ability of Large Language Models

3D Spatial Understanding in MLLMs: Disambiguation and Evaluation

Large Language Models are Visual Reasoning Coordinators

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

Teaching VLMs to Localize Specific Objects from In-context Examples

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions

Can Large Language Models Understand Spatial Audio?

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models