Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Chaoya Jiang,Haiyang Xu,Mengfan Dong,Jiaxing Chen,Wei Ye,Ming Yan,Qinghao Ye,Ji Zhang,Fei Huang,Shikun Zhang

2024-02-24

Abstract:Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However, MLLMs still face a fundamental limitation of hallucinations, where they tend to generate erroneous or fabricated information. In this paper, we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM, revealing two important findings: 1) there is a significant gap between textual and visual representations, indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled, making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically, we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples, naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively, showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on <a class="link-external link-https" href="https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is the tendency of Multimodal Large Language Models (MLLMs) to generate errors or fictitious information that do not match the visual input when producing text, a phenomenon known as "hallucination." Specifically, the authors identify two main issues by analyzing the representation distribution of text and visual tokens in MLLMs: 1. **Modality Gap**: There is a significant difference between text and visual representations, indicating poor cross-modal representation alignment. 2. **Representation Entanglement**: Text representations with and without hallucinations are intertwined and difficult to distinguish. These issues cause MLLMs to easily generate hallucinations when handling multimodal tasks, affecting the model's accuracy and reliability. Therefore, the paper proposes a new method—Hallucination-Augmented Contrastive Learning (HACL), which aims to reduce the occurrence of hallucinations through contrastive learning and improve the model's performance across multiple benchmarks.

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Hallucination of Multimodal Large Language Models: A Survey

Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

Unified Hallucination Detection for Multimodal Large Language Models

Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models

Visual Hallucinations of Multi-modal Large Language Models

Mitigating Multilingual Hallucination in Large Vision-Language Models

Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models

VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Evaluation and Analysis of Hallucination in Large Vision-Language Models

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data