Abstract:Multimodal sentiment analysis aims to predict sentiments from multimodal signals such as audio, video, and text. Existing methods often rely on Pre-trained Language Models (PLMs) to extract semantic information from textual data, lacking an in-depth understanding of the logical relationships within the text modality . This paper introduces the Multimodal PEAR Chain-of-Thought (MM-PEAR-CoT) reasoning for multimodal sentiment analysis. Inspired by the human thought process when solving complex problems, the PEAR (Preliminaries, quEstion, Answer, Reason) chain-of-thought prompt is first proposed to induce Large Language Models (LLMs) to generate text-based reasoning processes and zero-shot sentiment prediction results. However, text-based chain-of-thought reasoning is not always reliable and might contain irrational steps due to the hallucinations of large language models . To address this, we further design the Cross-Modal Filtering and Fusion (CMFF) module. The filtering submodule utilizes audio and visual modalities to suppress irrational steps in the chain of thought, while the fusion submodule integrates high-level reasoning information and cross-modal complementary information in the process of semantic representation learning. Experimental results on two multimodal sentiment analysis benchmark datasets show that high-level reasoning information can help learn discriminative text representation, and cross-modal complementary information can avoid misleading by unreasonable steps in the chain of thought. MM-PEAR-CoT achieves the best results on both datasets, with improvements of 2.2% and 1.7% in binary classification accuracy on the CMU-MOSI and CMU-MOSEI datasets, respectively. To the best of our knowledge, this is the first study to apply chain-of-thought reasoning to multimodal sentiment analysis.

MaTCR: Modality-Aligned Thought Chain Reasoning for Multimodal Task-Oriented Dialogue Generation

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps

Towards a Unified Multimodal Reasoning Framework

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog

Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning

MMCTAgent: Multi-modal Critical Thinking Agent Framework for Complex Visual Reasoning

Multimodal Reasoning with Multimodal Knowledge Graph

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models

ChatterBox: Multi-round Multimodal Referring and Grounding

Enhancing human-like multimodal reasoning: a new challenging dataset and comprehensive framework

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning