Abstract:Visual language reasoning requires a system to extract text or numbers from information-dense images like charts or plots and perform logical or arithmetic reasoning to arrive at an answer. To tackle this task, existing work relies on either (1) an end-to-end vision-language model trained on a large amount of data, or (2) a two-stage pipeline where a captioning model converts the image into text that is further read by another large language model to deduce the answer. However, the former approach forces the model to answer a complex question with one single step, and the latter approach is prone to inaccurate or distracting information in the converted text that can confuse the language model. In this work, we propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning. Given an input, System-2 breaks down the question into atomic sub-steps, each guiding System-1 to extract the information required for reasoning from the image. Experiments on chart and plot datasets show that our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data. By fine-tuning the System-2 module (LLaMA-2 70B) on only a small amount of data on multi-step reasoning, the accuracy of our method is further improved and surpasses the best fully-supervised end-to-end approach by 5.7% and a pipeline approach with FlanPaLM (540B) by 7.5% on a challenging dataset with human-authored questions.

What problem does this paper attempt to address?

The paper aims to address the issue of multi-step reasoning in visual language inference tasks, particularly in answering complex questions on charts and images. Existing methods either use end-to-end models that directly extract information from images and perform reasoning, or adopt a two-stage pipeline that first converts images into textual tables and then uses large language models for reasoning. However, the former struggles to provide answers in one go when faced with complex questions, while the latter is prone to misleading results due to information loss or distortion during the conversion process. The paper proposes a dual-system framework named DOMINO for multi-step visual language reasoning. This framework includes two modules: "System-1" is responsible for intuitively extracting visual information from images, and "System-2" is responsible for detailed logical reasoning. Specifically, given a question and a chart, System-2 decomposes the question into a series of sub-tasks and guides System-1 to extract the required information from the image. In this way, DOMINO allows for more interaction between the two modalities, thereby enhancing the ability to handle complex questions. Experimental results show that DOMINO outperforms existing end-to-end models and pipeline methods on multiple datasets, especially on datasets that require more reasoning steps. By fine-tuning the System-2 module (LLaMA-2 70B), DOMINO's performance is further improved, even surpassing fully supervised methods in some cases. Additionally, the study finds that describing operations is crucial for avoiding hallucinations, helping to generate effective queries.

DOMINO: A Dual-System for Multi-step Visual Language Reasoning

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Take A Step Back: Rethinking the Two Stages in Visual Reasoning

Chain of Images for Intuitively Reasoning

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning

Enhancing Advanced Visual Reasoning Ability of Large Language Models

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Multimodal Chain-of-Thought Reasoning in Language Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

Efficient End-to-End Visual Document Understanding with Rationale Distillation

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog