Abstract:Recent studies have shown that Vision Language Large Models (VLLMs) may output content not relevant to the input images. This problem, called the hallucination phenomenon, undoubtedly degrades VLLM performance. Therefore, various anti-hallucination techniques have been proposed to make model output more reasonable and accurate. Despite their successes, from extensive tests we found that augmenting the prompt (e.g. word appending, rewriting, and spell error etc.) may change model output and make the output hallucinate again. To cure this drawback, we propose a new instruct-tuning framework called Prompt Augmentation and Caption Utilization (PACU) to boost VLLM's generation ability under the augmented prompt scenario. Concretely, on the one hand, PACU exploits existing LLMs to augment and evaluate diverse prompts automatically. The resulting high-quality prompts are utilized to enhance VLLM's ability to process different prompts. On the other hand, PACU exploits image captions to jointly work with image features as well as the prompts for response generation. When the visual feature is inaccurate, LLM can capture useful information from the image captions for response generation. Extensive experiments on hallucination evaluation and prompt-augmented datasets demonstrate that our PACU method can work well with existing schemes to effectively boost VLLM model performance. Code is available in <a class="link-external link-https" href="https://github.com/zhaominyiz/PACU" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the hallucination phenomenon that occurs in Vision Language Large Models (VLLMs) when handling augmented prompts. Specifically, some existing VLLMs may produce erroneous or irrelevant content when faced with augmented prompts (such as adding words, rewriting, etc.), which severely limits the performance and application of these models. ### Solution To overcome this issue, the authors propose a new framework called **Prompt Augmentation and Caption Utilization (PACU)**. PACU enhances the generative capabilities of VLLMs through the following two aspects: 1. **Prompt Augmentation**: - Automatically generate and evaluate diverse prompts using existing large language models (LLMs). - Generate high-quality prompts through an automatic prompt augmentation module and score these prompts through a prompt evaluation module, retaining high-quality prompts for model training. 2. **Caption Utilization**: - Utilize image captions as prior knowledge to assist in response generation. - When visual features are inaccurate, LLMs can capture useful information from captions to generate more accurate responses. ### Experimental Validation The authors extensively validated the effectiveness of PACU through experiments. The results show that PACU not only improves the performance of VLLMs in handling augmented prompts but also enhances the accuracy of the models under the original unaugmented prompt settings. Additionally, PACU can work well with various VLLM frameworks and LLMs, demonstrating generality and flexibility. ### Main Contributions 1. **Introducing Prompt Augmentation to VLLMs**: Through extensive testing, it was found that existing VLLMs have significant deficiencies in handling augmented prompts. 2. **Proposing the PACU Framework**: By generating, evaluating, and reweighting diverse augmented prompts, and designing a caption utilization generation mechanism, the prompt handling capability of VLLMs is enhanced. 3. **Experimental Validation**: Experimental results on multiple hallucination evaluation datasets demonstrate the effectiveness and advantages of the PACU method. ### Related Work - **Vision Language Large Models (VLLMs)**: Combine large language models and visual modalities to perform various vision-language understanding tasks. - **Anti-Hallucination Techniques**: Many studies aim to detect and eliminate hallucination phenomena in VLLMs, but these techniques still have shortcomings in handling augmented prompts. ### Methodology The PACU framework includes a prompt augmentation module and a caption utilization generation mechanism. The prompt augmentation module generates diverse prompts and screens high-quality prompts through an evaluation module. The caption utilization generation mechanism uses image captions when generating responses to ensure more accurate responses. ### Performance Evaluation Experimental results show that PACU significantly improves the performance of VLLMs on multiple benchmark datasets, especially in handling augmented prompts. Visualization results further demonstrate the advantages of PACU, particularly in dealing with complex and diverse prompts. ### Conclusion As a plug-in framework, PACU can work in conjunction with existing anti-hallucination solutions, VLLM frameworks, and LLMs, effectively enhancing the prompt handling capability and generation quality of VLLMs.

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation

Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

Mitigating Hallucination in Visual-Language Models via Re-Balancing Contrastive Decoding

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Magnifier Prompt: Tackling Multimodal Hallucination via Extremely Simple Instructions

VaLiD: Mitigating the Hallucination of Large Vision Language Models by Visual Layer Fusion Contrastive Decoding

Pensieve: Retrospect-then-Compare Mitigates Visual Hallucination

Mitigating Multilingual Hallucination in Large Vision-Language Models

Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding