SpatialBot: Precise Spatial Understanding with Vision Language Models

Wenxiao Cai,Yaroslav Ponomarenko,Jianhao Yuan,Xiaoqi Li,Wankou Yang,Hao Dong,Bo Zhao

2024-08-01

Abstract:Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at <a class="link-external link-https" href="https://github.com/BAAI-DCAI/SpatialBot" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiency of Vision Language Models (VLMs) in spatial understanding. Although VLMs have made remarkable progress in 2D image understanding, they still face challenges when dealing with spatial understanding tasks, especially in understanding depth information. These problems are specifically manifested in the following aspects: 1. **Limitations in understanding depth information**: Existing VLMs are mainly trained on RGB images and lack the ability to understand depth maps. This leads to their poor performance when depth maps are directly input. 2. **Lack of appropriate training datasets**: Currently, there are no datasets specifically designed for training VLMs to understand depth. Common VLM tuning datasets do not contain corresponding depth maps, and depth - related tasks lack data in question - and - answer format. 3. **Inconsistent indoor and outdoor depth scales**: The range of depth values and the precision requirements in indoor and outdoor scenes are different. For example, indoor navigation and manipulation tasks require millimeter - level precision, while outdoor tasks require a wider depth range. To solve these problems, the paper proposes the following solutions: - **SpatialBot**: By introducing RGB and depth maps as input, enhance the spatial understanding ability of VLMs. - **SpatialQA dataset**: Construct a dataset containing multi - level depth - related questions for training VLMs to understand depth. - **SpatialBench benchmark**: Design a set of comprehensive benchmarks to evaluate the spatial understanding ability of VLMs at different levels. Through these methods, the paper aims to improve the performance of VLMs in spatial understanding tasks and verify their improvement effects in general VLM benchmarks and robotic manipulation tasks.

SpatialBot: Precise Spatial Understanding with Vision Language Models

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

Structured Spatial Reasoning with Open Vocabulary Object Detectors

SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors

Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation

Things not Written in Text: Exploring Spatial Commonsense from Visual Signals

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

What's "up" with vision-language models? Investigating their struggle with spatial reasoning

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

SPA: 3D Spatial-Awareness Enables Effective Embodied Representation

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Weakly-Supervised 3D Spatial Reasoning for Text-based Visual Question Answering

Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description