Abstract:Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the \underline{B}i-Modal \underline{B}ehavioral \underline{A}lignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving ($28.34\% \to 34.22\%$), chess positional advantage prediction ($42.08\% \to 46.99\%$) and molecular property prediction ($77.47\% \to 83.52\%$).

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address two main challenges faced by Large Vision-Language Models (LVLMs) in multimodal reasoning tasks: 1. **Fusion of Different Modal Representations**: - LVLMs struggle to effectively leverage the unique advantages of visual data and Domain-Specific Language (DSL) when handling complex multimodal reasoning tasks. This is because the reasoning mechanisms of visual and DSL representations differ, leading to inconsistencies or even conflicts during the reasoning process. 2. **Identification and Resolution of Key Steps**: - These models encounter difficulties in performing multi-step reasoning, particularly in handling the key steps required to solve complex problems. This limits their effectiveness in practical applications. To address these challenges, the authors propose the Bimodal Behavior Alignment (BBA) prompting method. BBA is implemented through the following steps: - **Generation of Independent Reasoning Chains**: - First, LVLMs are guided to generate reasoning chains separately from visual and DSL representations. - **Alignment of Reasoning Chains**: - Then, by identifying and resolving inconsistencies between these reasoning chains, harmonious alignment of different modal behaviors is achieved. ### Experimental Results BBA demonstrates significant performance improvements in various multimodal reasoning tasks, specifically including: - **Geometric Problem Solving**: Improved from 28.34% to 34.22%. - **Chess Position Advantage Prediction**: Improved from 42.08% to 46.99%. - **Molecular Property Prediction**: Improved from 77.47% to 83.52%. ### Main Contributions 1. **Effective Fusion of Different Modal Advantages**: - BBA employs a "late fusion" strategy, preserving the inherent advantages of direct visual input and DSL representations. 2. **Identification and Resolution of Key Steps**: - By revealing differences between various reasoning chains, BBA can more effectively allocate intermediate tokens, thereby better handling key steps. ### Conclusion BBA successfully enhances the performance of LVLMs in multimodal reasoning tasks by independently generating reasoning chains and resolving inconsistencies, showing significant improvements particularly in tasks such as geometric problem solving, chess position advantage prediction, and molecular property prediction.

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Bi-Chainer: Automated Large Language Models Reasoning with Bidirectional Chaining

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Large Language Models are Visual Reasoning Coordinators

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Smart Vision-Language Reasoners

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models