Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Bingshuai Liu,Chenyang Lyu,Zijun Min,Zhanyu Wang,Jinsong Su,Longyue Wang

2024-03-03

Abstract:The advancement of Large Language Models (LLMs) has brought substantial attention to the Chain of Thought (CoT) approach, primarily due to its ability to enhance the capability of LLMs on complex reasoning tasks. Moreover, the significance of CoT approaches extends to the application of LLMs for multi-modal tasks. However, the selection of optimal CoT demonstration examples in multi-modal reasoning remains less explored for LLMs due to the inherent complexity of multi-modal examples. In this paper, we introduce a novel approach that addresses this challenge by using retrieval mechanisms to dynamically and automatically select demonstration examples based on cross-modal and intra-modal similarities. Furthermore, we employ a Stratified Sampling method of categorising demonstration examples into groups based on their types and then retrieving examples from different groups respectively to promote the diversity of demonstration examples. Through a series of experiments on two popular benchmark datasets: ScienceQA and MathVista, we demonstrate that our approach significantly improves the performance of GPT-4 by 6% on ScienceQA and 12.9% on MathVista, and enhances the performance of GPT-4V on two datasets by 2.7%, substantially improving the performance of the most advanced LLMs and LMMs for complex multi-modal reasoning tasks.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the challenges faced by large language models (LLMs) in selecting the optimal Chain of Thought (CoT) examples for multimodal reasoning tasks. Specifically, due to the inherent complexity of multimodal examples, dynamically and automatically selecting appropriate CoT examples to guide multimodal reasoning has been an underexplored issue. To tackle this problem, the paper proposes a novel approach that utilizes a retrieval mechanism to dynamically select examples based on cross-modal and intra-modal similarity, and further enhances the diversity of examples through a Stratified Sampling method. Experimental results show that this approach significantly improves the performance of models like GPT-4 on two benchmark datasets, ScienceQA and MathVista. For instance, on the ScienceQA dataset, the performance of GPT-4 improved by 6% when combined with this method, while on the MathVista dataset, the performance increased by 12.9%. Additionally, for the GPT-4V model, this method is also effective, with an average accuracy improvement of 2.7%. This indicates that the method is applicable not only to text-only LLMs but also to multimodal models that include visual information.

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Automatic Chain of Thought Prompting in Large Language Models

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Pattern-Aware Chain-of-Thought Prompting in Large Language Models

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

Rethinking with Retrieval: Faithful Large Language Model Inference

CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine Chain-of-Thought Prompting for Multi-domain NLU Tasks

AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations

Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding