Abstract:As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not significantly increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only significantly enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We will release our data and code.

What problem does this paper attempt to address?

The problem this paper attempts to address is the significant performance limitations of existing large models in understanding and executing ambiguous instructions in multimodal tasks due to the inherent ambiguity of natural language. Specifically, when faced with ambiguous instructions in real-world scenarios, these models often need to combine visual context or common sense to accurately understand the instructions, but their disambiguation capabilities are weak, leading to serious errors. To solve this problem, the paper proposes a multimodal multi-turn chain-of-thought reasoning framework named **VISUAL-O1**, which aims to help models correctly understand ambiguous instructions and generate correct answers by simulating the human multimodal multi-turn reasoning process. ### Main Contributions: 1. **Revealing the capabilities of multimodal models**: By establishing a new benchmark, the ability of multimodal models to analyze and execute ambiguous instructions in various multimodal tasks is evaluated. 2. **Proposing the VISUAL-O1 framework**: This is a multimodal multi-turn chain-of-thought reasoning method that can construct instance experience and experiential experience for high-intelligence models and general-intelligence models respectively, helping them correctly understand ambiguous instructions. 3. **Experimental validation**: Experimental results show that VISUAL-O1 significantly improves the performance of models with different intelligence levels on ambiguous instruction datasets and also performs well on general datasets. ### Method Overview: - **Multimodal multi-turn chain-of-thought reasoning**: Through multiple rounds of reasoning and reflection, the reasoning results are gradually refined to finally generate the correct answer. - **Instance experience and experiential experience**: For high-intelligence models, the correct answer is directly generated through instance experience; for general-intelligence models, ambiguous instructions are transformed into clear instructions through experiential experience, and then the answer is generated. - **Response synthesis**: The reasoning process and the original instructions are combined to generate the final answer. ### Experimental Results: - **Understanding ambiguous instructions**: VISUAL-O1 significantly improves the model's ability to understand ambiguous instructions in multiple tasks (such as image segmentation and visual question answering). - **Performance on general datasets**: It not only performs excellently on ambiguous instruction datasets but also significantly improves the model's performance on general datasets. - **Generalization ability**: It is applicable to models with different intelligence levels and different multimodal tasks, demonstrating its broad applicability and potential. ### Conclusion: By proposing the VISUAL-O1 framework, the paper effectively solves the problem of understanding ambiguous instructions in multimodal tasks, providing new ideas and methods for the application of artificial intelligence in the real world.

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Multimodal Chain-of-Thought Reasoning in Language Models

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

VISUALCODER: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

Chain of Images for Intuitively Reasoning

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

KM4: Visual Reasoning Via Knowledge Embedding Memory Model with Mutual Modulation

KM 4 : Visual reasoning via Knowledge Embedding Memory Model with Mutual Modulation

INFORM : Information Entropy Based Multi-Step Reasoning FOR Large Language Models