Abstract:Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: <a class="link-external link-https" href="https://github.com/FatemehShiri/Spatial-MM" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to investigate the capabilities of large multimodal models (LMMs) in spatial understanding and reasoning. Although LMMs have shown excellent performance in various visual and language tasks, their spatial reasoning abilities have not been fully explored. To comprehensively evaluate the spatial reasoning capabilities of LMMs, the authors constructed a new visual question answering (VQA) dataset—Spatial-MM, and conducted an in-depth analysis of LMMs using this dataset. ### Specific Questions 1. **In which spatial relationships do LMMs fall short? Why do these issues occur?** - The authors explored the performance of LMMs in handling different spatial relationships by constructing a dataset that includes various spatial relationships. 2. **How does additional symbolic visual information (such as bounding boxes or scene graphs) improve the spatial reasoning performance of LMMs? Which type of symbolic information is more useful, and how can it be effectively integrated into the reasoning process?** - The authors experimentally verified the impact of bounding boxes and scene graphs on the spatial reasoning capabilities of LMMs and analyzed the effectiveness of this information. 3. **How does the complexity of the questions affect the ability of LMMs to handle spatial relationships?** - The authors studied the impact of question complexity on the spatial reasoning capabilities of LMMs by designing multi-hop reasoning questions of varying complexity. 4. **When LMMs fail to answer multi-hop questions, how do their reasoning paths perform? Are the failures due to spatial reasoning errors or non-spatial reasoning errors?** - The authors analyzed the performance of LMMs in multi-hop reasoning by generating and verifying reasoning paths and explored the reasons for failures. ### Main Contributions 1. **Proposed a new, challenging spatial awareness benchmark dataset Spatial-MM**, covering various types of spatial relationships, including questions posed from human and camera perspectives. 2. **Comprehensive empirical analysis revealed the following important findings**: - Bounding boxes and scene graphs, even synthetic ones, can significantly enhance the spatial reasoning capabilities of LMMs. - LMMs find it more challenging to handle questions posed from a human perspective than from a camera perspective. - Chain-of-Thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relationships. - LMMs perform better in basic object detection but are weaker in complex spatial reasoning. Through these studies, the authors hope to inspire more research directions on the spatial reasoning capabilities of LMMs.

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

SpatialBot: Precise Spatial Understanding with Vision Language Models

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Exploring and Improving the Spatial Reasoning Abilities of Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model

Inherent limitations of LLMs regarding spatial information

SAT: Spatial Aptitude Training for Multimodal Language Models