DOMINO: A Dual-System for Multi-step Visual Language Reasoning

Peifang Wang,Olga Golovneva,Armen Aghajanyan,Xiang Ren,Muhao Chen,Asli Celikyilmaz,Maryam Fazel-Zarandi
2023-10-04
Abstract:Visual language reasoning requires a system to extract text or numbers from information-dense images like charts or plots and perform logical or arithmetic reasoning to arrive at an answer. To tackle this task, existing work relies on either (1) an end-to-end vision-language model trained on a large amount of data, or (2) a two-stage pipeline where a captioning model converts the image into text that is further read by another large language model to deduce the answer. However, the former approach forces the model to answer a complex question with one single step, and the latter approach is prone to inaccurate or distracting information in the converted text that can confuse the language model. In this work, we propose a dual-system for multi-step multimodal reasoning, which consists of a "System-1" step for visual information extraction and a "System-2" step for deliberate reasoning. Given an input, System-2 breaks down the question into atomic sub-steps, each guiding System-1 to extract the information required for reasoning from the image. Experiments on chart and plot datasets show that our method with a pre-trained System-2 module performs competitively compared to prior work on in- and out-of-distribution data. By fine-tuning the System-2 module (LLaMA-2 70B) on only a small amount of data on multi-step reasoning, the accuracy of our method is further improved and surpasses the best fully-supervised end-to-end approach by 5.7% and a pipeline approach with FlanPaLM (540B) by 7.5% on a challenging dataset with human-authored questions.
Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of multi-step reasoning in visual language inference tasks, particularly in answering complex questions on charts and images. Existing methods either use end-to-end models that directly extract information from images and perform reasoning, or adopt a two-stage pipeline that first converts images into textual tables and then uses large language models for reasoning. However, the former struggles to provide answers in one go when faced with complex questions, while the latter is prone to misleading results due to information loss or distortion during the conversion process. The paper proposes a dual-system framework named DOMINO for multi-step visual language reasoning. This framework includes two modules: "System-1" is responsible for intuitively extracting visual information from images, and "System-2" is responsible for detailed logical reasoning. Specifically, given a question and a chart, System-2 decomposes the question into a series of sub-tasks and guides System-1 to extract the required information from the image. In this way, DOMINO allows for more interaction between the two modalities, thereby enhancing the ability to handle complex questions. Experimental results show that DOMINO outperforms existing end-to-end models and pipeline methods on multiple datasets, especially on datasets that require more reasoning steps. By fine-tuning the System-2 module (LLaMA-2 70B), DOMINO's performance is further improved, even surpassing fully supervised methods in some cases. Additionally, the study finds that describing operations is crucial for avoiding hallucinations, helping to generate effective queries.