Compositional Chain-of-Thought Prompting for Large Multimodal Models

Chancharik Mitra,Brandon Huang,Trevor Darrell,Roei Herzig

2024-04-01

Abstract:The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code:

Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper aims to address the issue of compositional reasoning in large multimodal models (LMMs) when handling vision and language tasks. Specifically, although current state-of-the-art LMMs have shown excellent performance in vision and language tasks, they still fall short in capturing compositional visual reasoning, such as object attributes and relationships between objects. The paper proposes a new method called Compositional Chain-of-Thought (CCoT), which extracts compositional knowledge from LMMs by generating Scene Graphs, thereby improving their performance in compositional visual reasoning tasks. The CCoT method achieves zero-shot learning without requiring additional fine-tuning or annotated scene graph data and is applicable to various LMM architectures. Experimental results show that CCoT not only enhances the performance of LMMs on multiple benchmarks but also demonstrates significant performance improvements in general multimodal benchmarks. Moreover, the method avoids the catastrophic forgetting problem that may arise from training based on scene graph data. In summary, the main contribution of the paper is the proposal of a novel zero-shot chain-of-thought method that effectively enhances LMM performance in compositional visual understanding, while also being widely applicable and easy to use.

Compositional Chain-of-Thought Prompting for Large Multimodal Models

CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs

Interleaved-Modal Chain-of-Thought

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Prompting Large Vision-Language Models for Compositional Reasoning

Multimodal Chain-of-Thought Reasoning in Language Models

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

CoF-CoT: Enhancing Large Language Models with Coarse-to-Fine Chain-of-Thought Prompting for Multi-domain NLU Tasks

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

Supervised Chain of Thought

Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models

AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Chain-of-Symbol Prompting Elicits Planning in Large Langauge Models

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models