SpatialBot: Precise Spatial Understanding with Vision Language Models

Wenxiao Cai,Yaroslav Ponomarenko,Jianhao Yuan,Xiaoqi Li,Wankou Yang,Hao Dong,Bo Zhao
2024-08-01
Abstract:Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at <a class="link-external link-https" href="https://github.com/BAAI-DCAI/SpatialBot" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiency of Vision Language Models (VLMs) in spatial understanding. Although VLMs have made remarkable progress in 2D image understanding, they still face challenges when dealing with spatial understanding tasks, especially in understanding depth information. These problems are specifically manifested in the following aspects: 1. **Limitations in understanding depth information**: Existing VLMs are mainly trained on RGB images and lack the ability to understand depth maps. This leads to their poor performance when depth maps are directly input. 2. **Lack of appropriate training datasets**: Currently, there are no datasets specifically designed for training VLMs to understand depth. Common VLM tuning datasets do not contain corresponding depth maps, and depth - related tasks lack data in question - and - answer format. 3. **Inconsistent indoor and outdoor depth scales**: The range of depth values and the precision requirements in indoor and outdoor scenes are different. For example, indoor navigation and manipulation tasks require millimeter - level precision, while outdoor tasks require a wider depth range. To solve these problems, the paper proposes the following solutions: - **SpatialBot**: By introducing RGB and depth maps as input, enhance the spatial understanding ability of VLMs. - **SpatialQA dataset**: Construct a dataset containing multi - level depth - related questions for training VLMs to understand depth. - **SpatialBench benchmark**: Design a set of comprehensive benchmarks to evaluate the spatial understanding ability of VLMs at different levels. Through these methods, the paper aims to improve the performance of VLMs in spatial understanding tasks and verify their improvement effects in general VLM benchmarks and robotic manipulation tasks.