Abstract:Chain of thought (CoT) has proven useful for problems requiring complex reasoning. Many of these problems are both textual and multimodal. Given the inputs in different modalities, a model generates a rationale and then uses it to answer a question. Because of the hallucination issue, the generated soft negative rationales with high textual quality but illogical semantics do not always help improve answer accuracy. This study proposes a rationale generation method using soft negative sampling (SNSE-CoT) to mitigate hallucinations in multimodal CoT. Five methods were applied to generate soft negative samples that shared highly similar text but had different semantics from the original. Bidirectional margin loss (BML) was applied to introduce them into the traditional contrastive learning framework that involves only positive and negative samples. Extensive experiments on the ScienceQA dataset demonstrated the effectiveness of the proposed method. Code and data are released at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the hallucination problem that occurs when generating high - quality reasoning processes in Multimodal Chain of Thought (CoT). Specifically, when the model is handling tasks that require complex reasoning, especially when dealing with input data of different modalities (such as text and image), although the generated reasoning process is of high text quality, its logical semantics may not be correct, which will lead to a decrease in the accuracy of the final answer. To solve this problem, the paper proposes a method using Soft Negative Sampling (SNS) to enhance the semantic correctness in multimodal CoT, thereby alleviating the hallucination phenomenon and improving the accuracy of the answer. ### Main contributions of the paper 1. **Introduction of soft negative samples**: The paper proposes five methods to generate soft negative samples. These samples have high text similarity but different semantics from the original samples, aiming to enhance the model's discriminative ability of the reasoning process through the contrastive learning framework. 2. **Bidirectional margin loss**: In order to introduce soft negative samples into the traditional contrastive learning framework, the paper introduces Bidirectional Margin Loss (BML) to more effectively constrain the semantic differences between positive samples and soft negative samples. 3. **Experimental verification**: Through extensive experiments on the ScienceQA dataset, the effectiveness of the proposed method is proved. Especially when dealing with questions containing paired images, the performance is significantly improved, becoming the first method to exceed 90% accuracy in this type of questions. ### Specific methods to solve the problem - **Soft negative sample generation**: Generate soft negative samples through five methods, including affirmative - negative conversion, number conversion, direction conversion, unit conversion and option conversion. - **Bidirectional margin loss**: Calculate the cosine similarity difference between positive samples and soft negative samples, and use BML to constrain this difference to ensure that the model can distinguish between positive samples and soft negative samples. - **Training objective**: Combine negative log - likelihood loss and BML loss to optimize the training objective of the model and balance the ability to generate high - quality reasoning processes and correct answers. ### Experimental results - **Performance improvement**: Compared with the existing multimodal CoT methods, SNSE - CoT has achieved significant performance improvements in multiple task categories, with an average performance improvement of about 2.5 to 3%. - **Advantages in specific tasks**: Especially when dealing with questions containing paired images, the performance of SNSE - CoT is particularly prominent, becoming the first method to exceed 90% accuracy in this type of questions. ### Conclusion By introducing soft negative samples and bidirectional margin loss, the paper effectively alleviates the hallucination problem in multimodal CoT and improves the semantic correctness of the reasoning process and the accuracy of the answer. The experimental results show that the proposed method performs well in multiple task categories, especially when dealing with complex multimodal reasoning tasks, it has significant advantages.

Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

Multimodal Chain-of-Thought Reasoning in Language Models

Multimodal PEAR Chain-of-Thought Reasoning for Multimodal Sentiment Analysis

MiCEval: Unveiling Multimodal Chain of Thought's Quality via Image Description and Reasoning Steps

Compositional Chain-of-Thought Prompting for Large Multimodal Models

Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Improve Vision Language Model Chain-of-thought Reasoning

Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering

DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models

Chain-of-Thought Prompt Distillation for Multimodal Named Entity Recognition and Multimodal Relation Extraction

Multi-modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

Interleaved-Modal Chain-of-Thought

Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Cantor: Inspiring Multimodal Chain-of-Thought of MLLM

CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation