Abstract:Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes at <a class="link-external link-https" href="https://microsoft.github.io/visualization-of-thought" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the shortcomings of large language models (LLMs) in spatial reasoning capabilities. Although LLMs excel in language understanding and various reasoning tasks, their ability in spatial reasoning has not been fully explored. Humans have the ability to create mental images of unseen objects and actions through the "Mind’s Eye," which allows them to imagine the unseen world. Inspired by this cognitive ability, the authors propose the "Visualization-of-Thought" (VoT) prompting method. The goal of VoT is to enhance the spatial reasoning capabilities of LLMs by visualizing the reasoning process to guide subsequent reasoning steps.
### Specific Research Background and Objectives
1. **Importance of Spatial Reasoning**:
- Spatial reasoning is a crucial part of human cognition, enabling us to interact with the environment and understand the spatial relationships and movements of objects.
- In fields such as navigation, robotics, and autonomous driving, spatial reasoning is fundamental for planning actions.
2. **Limitations of Existing Research**:
- Although many tasks and datasets explore spatial semantics in text, existing research often focuses on the linguistic structure of spatial terms.
- Even significant achievements in these benchmarks do not necessarily mean that LLMs truly understand spatial information or can accurately measure their spatial awareness.
3. **Proposed Method**:
- **VoT Prompting Method**: By adding visual state tracking at each intermediate reasoning step, generating reasoning trajectories and visualizations, to enhance the spatial reasoning capabilities of LLMs.
- **Experimental Tasks**: Three tasks requiring spatial awareness were selected, including natural language navigation, visual navigation, and visual puzzles, which require understanding spatial, directional, and geometric reasoning.
### Main Contributions
1. **Exploring the "Mind’s Eye" of LLMs from a Cognitive Perspective**:
- Conducted quantitative and qualitative analyses of the "Mind’s Eye" of LLMs, exploring its limitations and the impact of code pre-training on its capabilities.
2. **Developing New Tasks and Datasets**:
- Designed two tasks, "visual navigation" and "visual puzzles," and generated corresponding synthetic datasets to simulate various sensory inputs for LLMs.
3. **Proposing and Evaluating the VoT Prompting Method**:
- Experimental results show that the VoT prompting method significantly improves the spatial reasoning capabilities of LLMs across multiple tasks, outperforming other prompting methods and existing multimodal LLMs.
### Conclusion
By introducing the VoT prompting method, the authors successfully enhanced the spatial reasoning capabilities of LLMs, enabling them to better understand and process spatial information. This method not only holds theoretical significance but also demonstrates potential value in practical applications.