Enhancing Semantics in Multimodal Chain of Thought via Soft Negative Sampling

Guangmin Zheng,Jin Wang,Xiaobing Zhou,Xuejie Zhang
2024-05-16
Abstract:Chain of thought (CoT) has proven useful for problems requiring complex reasoning. Many of these problems are both textual and multimodal. Given the inputs in different modalities, a model generates a rationale and then uses it to answer a question. Because of the hallucination issue, the generated soft negative rationales with high textual quality but illogical semantics do not always help improve answer accuracy. This study proposes a rationale generation method using soft negative sampling (SNSE-CoT) to mitigate hallucinations in multimodal CoT. Five methods were applied to generate soft negative samples that shared highly similar text but had different semantics from the original. Bidirectional margin loss (BML) was applied to introduce them into the traditional contrastive learning framework that involves only positive and negative samples. Extensive experiments on the ScienceQA dataset demonstrated the effectiveness of the proposed method. Code and data are released at
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the hallucination problem that occurs when generating high - quality reasoning processes in Multimodal Chain of Thought (CoT). Specifically, when the model is handling tasks that require complex reasoning, especially when dealing with input data of different modalities (such as text and image), although the generated reasoning process is of high text quality, its logical semantics may not be correct, which will lead to a decrease in the accuracy of the final answer. To solve this problem, the paper proposes a method using Soft Negative Sampling (SNS) to enhance the semantic correctness in multimodal CoT, thereby alleviating the hallucination phenomenon and improving the accuracy of the answer. ### Main contributions of the paper 1. **Introduction of soft negative samples**: The paper proposes five methods to generate soft negative samples. These samples have high text similarity but different semantics from the original samples, aiming to enhance the model's discriminative ability of the reasoning process through the contrastive learning framework. 2. **Bidirectional margin loss**: In order to introduce soft negative samples into the traditional contrastive learning framework, the paper introduces Bidirectional Margin Loss (BML) to more effectively constrain the semantic differences between positive samples and soft negative samples. 3. **Experimental verification**: Through extensive experiments on the ScienceQA dataset, the effectiveness of the proposed method is proved. Especially when dealing with questions containing paired images, the performance is significantly improved, becoming the first method to exceed 90% accuracy in this type of questions. ### Specific methods to solve the problem - **Soft negative sample generation**: Generate soft negative samples through five methods, including affirmative - negative conversion, number conversion, direction conversion, unit conversion and option conversion. - **Bidirectional margin loss**: Calculate the cosine similarity difference between positive samples and soft negative samples, and use BML to constrain this difference to ensure that the model can distinguish between positive samples and soft negative samples. - **Training objective**: Combine negative log - likelihood loss and BML loss to optimize the training objective of the model and balance the ability to generate high - quality reasoning processes and correct answers. ### Experimental results - **Performance improvement**: Compared with the existing multimodal CoT methods, SNSE - CoT has achieved significant performance improvements in multiple task categories, with an average performance improvement of about 2.5 to 3%. - **Advantages in specific tasks**: Especially when dealing with questions containing paired images, the performance of SNSE - CoT is particularly prominent, becoming the first method to exceed 90% accuracy in this type of questions. ### Conclusion By introducing soft negative samples and bidirectional margin loss, the paper effectively alleviates the hallucination problem in multimodal CoT and improves the semantic correctness of the reasoning process and the accuracy of the answer. The experimental results show that the proposed method performs well in multiple task categories, especially when dealing with complex multimodal reasoning tasks, it has significant advantages.