BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Xueliang Zhao,Xinting Huang,Tingchen Fu,Qintong Li,Shansan Gong,Lemao Liu,Wei Bi,Lingpeng Kong
2024-02-21
Abstract:Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the \underline{B}i-Modal \underline{B}ehavioral \underline{A}lignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving ($28.34\% \to 34.22\%$), chess positional advantage prediction ($42.08\% \to 46.99\%$) and molecular property prediction ($77.47\% \to 83.52\%$).
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address two main challenges faced by Large Vision-Language Models (LVLMs) in multimodal reasoning tasks: 1. **Fusion of Different Modal Representations**: - LVLMs struggle to effectively leverage the unique advantages of visual data and Domain-Specific Language (DSL) when handling complex multimodal reasoning tasks. This is because the reasoning mechanisms of visual and DSL representations differ, leading to inconsistencies or even conflicts during the reasoning process. 2. **Identification and Resolution of Key Steps**: - These models encounter difficulties in performing multi-step reasoning, particularly in handling the key steps required to solve complex problems. This limits their effectiveness in practical applications. To address these challenges, the authors propose the Bimodal Behavior Alignment (BBA) prompting method. BBA is implemented through the following steps: - **Generation of Independent Reasoning Chains**: - First, LVLMs are guided to generate reasoning chains separately from visual and DSL representations. - **Alignment of Reasoning Chains**: - Then, by identifying and resolving inconsistencies between these reasoning chains, harmonious alignment of different modal behaviors is achieved. ### Experimental Results BBA demonstrates significant performance improvements in various multimodal reasoning tasks, specifically including: - **Geometric Problem Solving**: Improved from 28.34% to 34.22%. - **Chess Position Advantage Prediction**: Improved from 42.08% to 46.99%. - **Molecular Property Prediction**: Improved from 77.47% to 83.52%. ### Main Contributions 1. **Effective Fusion of Different Modal Advantages**: - BBA employs a "late fusion" strategy, preserving the inherent advantages of direct visual input and DSL representations. 2. **Identification and Resolution of Key Steps**: - By revealing differences between various reasoning chains, BBA can more effectively allocate intermediate tokens, thereby better handling key steps. ### Conclusion BBA successfully enhances the performance of LVLMs in multimodal reasoning tasks by independently generating reasoning chains and resolving inconsistencies, showing significant improvements particularly in tasks such as geometric problem solving, chess position advantage prediction, and molecular property prediction.