An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Fatemeh Shiri,Xiao-Yu Guo,Mona Golestan Far,Xin Yu,Gholamreza Haffari,Yuan-Fang Li
2024-11-09
Abstract:Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: <a class="link-external link-https" href="https://github.com/FatemehShiri/Spatial-MM" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to investigate the capabilities of large multimodal models (LMMs) in spatial understanding and reasoning. Although LMMs have shown excellent performance in various visual and language tasks, their spatial reasoning abilities have not been fully explored. To comprehensively evaluate the spatial reasoning capabilities of LMMs, the authors constructed a new visual question answering (VQA) dataset—Spatial-MM, and conducted an in-depth analysis of LMMs using this dataset. ### Specific Questions 1. **In which spatial relationships do LMMs fall short? Why do these issues occur?** - The authors explored the performance of LMMs in handling different spatial relationships by constructing a dataset that includes various spatial relationships. 2. **How does additional symbolic visual information (such as bounding boxes or scene graphs) improve the spatial reasoning performance of LMMs? Which type of symbolic information is more useful, and how can it be effectively integrated into the reasoning process?** - The authors experimentally verified the impact of bounding boxes and scene graphs on the spatial reasoning capabilities of LMMs and analyzed the effectiveness of this information. 3. **How does the complexity of the questions affect the ability of LMMs to handle spatial relationships?** - The authors studied the impact of question complexity on the spatial reasoning capabilities of LMMs by designing multi-hop reasoning questions of varying complexity. 4. **When LMMs fail to answer multi-hop questions, how do their reasoning paths perform? Are the failures due to spatial reasoning errors or non-spatial reasoning errors?** - The authors analyzed the performance of LMMs in multi-hop reasoning by generating and verifying reasoning paths and explored the reasons for failures. ### Main Contributions 1. **Proposed a new, challenging spatial awareness benchmark dataset Spatial-MM**, covering various types of spatial relationships, including questions posed from human and camera perspectives. 2. **Comprehensive empirical analysis revealed the following important findings**: - Bounding boxes and scene graphs, even synthetic ones, can significantly enhance the spatial reasoning capabilities of LMMs. - LMMs find it more challenging to handle questions posed from a human perspective than from a camera perspective. - Chain-of-Thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relationships. - LMMs perform better in basic object detection but are weaker in complex spatial reasoning. Through these studies, the authors hope to inspire more research directions on the spatial reasoning capabilities of LMMs.