Abstract:Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks. However, their abilities in spatial reasoning, a crucial aspect of human cognition, remain relatively unexplored. Human possess a remarkable ability to create mental images of unseen objects and actions through a process known as the Mind's Eye, enabling the imagination of the unseen world. Inspired by this cognitive capacity, we propose Visualization-of-Thought (VoT) prompting. VoT aims to elicit spatial reasoning of LLMs by visualizing their reasoning traces, thereby guiding subsequent reasoning steps. We employed VoT for multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual tiling in 2D grid worlds. Experimental results demonstrated that VoT significantly enhances the spatial reasoning abilities of LLMs. Notably, VoT outperformed existing multimodal large language models (MLLMs) in these tasks. While VoT works surprisingly well on LLMs, the ability to generate mental images to facilitate spatial reasoning resembles the mind's eye process, suggesting its potential viability in MLLMs. Please find the dataset and codes at <a class="link-external link-https" href="https://microsoft.github.io/visualization-of-thought" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the shortcomings of large language models (LLMs) in spatial reasoning capabilities. Although LLMs excel in language understanding and various reasoning tasks, their ability in spatial reasoning has not been fully explored. Humans have the ability to create mental images of unseen objects and actions through the "Mind’s Eye," which allows them to imagine the unseen world. Inspired by this cognitive ability, the authors propose the "Visualization-of-Thought" (VoT) prompting method. The goal of VoT is to enhance the spatial reasoning capabilities of LLMs by visualizing the reasoning process to guide subsequent reasoning steps. ### Specific Research Background and Objectives 1. **Importance of Spatial Reasoning**: - Spatial reasoning is a crucial part of human cognition, enabling us to interact with the environment and understand the spatial relationships and movements of objects. - In fields such as navigation, robotics, and autonomous driving, spatial reasoning is fundamental for planning actions. 2. **Limitations of Existing Research**: - Although many tasks and datasets explore spatial semantics in text, existing research often focuses on the linguistic structure of spatial terms. - Even significant achievements in these benchmarks do not necessarily mean that LLMs truly understand spatial information or can accurately measure their spatial awareness. 3. **Proposed Method**: - **VoT Prompting Method**: By adding visual state tracking at each intermediate reasoning step, generating reasoning trajectories and visualizations, to enhance the spatial reasoning capabilities of LLMs. - **Experimental Tasks**: Three tasks requiring spatial awareness were selected, including natural language navigation, visual navigation, and visual puzzles, which require understanding spatial, directional, and geometric reasoning. ### Main Contributions 1. **Exploring the "Mind’s Eye" of LLMs from a Cognitive Perspective**: - Conducted quantitative and qualitative analyses of the "Mind’s Eye" of LLMs, exploring its limitations and the impact of code pre-training on its capabilities. 2. **Developing New Tasks and Datasets**: - Designed two tasks, "visual navigation" and "visual puzzles," and generated corresponding synthetic datasets to simulate various sensory inputs for LLMs. 3. **Proposing and Evaluating the VoT Prompting Method**: - Experimental results show that the VoT prompting method significantly improves the spatial reasoning capabilities of LLMs across multiple tasks, outperforming other prompting methods and existing multimodal LLMs. ### Conclusion By introducing the VoT prompting method, the authors successfully enhanced the spatial reasoning capabilities of LLMs, enabling them to better understand and process spatial information. This method not only holds theoretical significance but also demonstrates potential value in practical applications.

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Large Language Models are Visual Reasoning Coordinators

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Enhance Reasoning Ability of Visual-Language Models via Large Language Models

How Far Are We from Intelligent Visual Deductive Reasoning?

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Language-Image Models with 3D Understanding