Abstract:Recent studies have shown that Vision Language Large Models (VLLMs) may output content not relevant to the input images. This problem, called the hallucination phenomenon, undoubtedly degrades VLLM performance. Therefore, various anti-hallucination techniques have been proposed to make model output more reasonable and accurate. Despite their successes, from extensive tests we found that augmenting the prompt (e.g. word appending, rewriting, and spell error etc.) may change model output and make the output hallucinate again. To cure this drawback, we propose a new instruct-tuning framework called Prompt Augmentation and Caption Utilization (PACU) to boost VLLM's generation ability under the augmented prompt scenario. Concretely, on the one hand, PACU exploits existing LLMs to augment and evaluate diverse prompts automatically. The resulting high-quality prompts are utilized to enhance VLLM's ability to process different prompts. On the other hand, PACU exploits image captions to jointly work with image features as well as the prompts for response generation. When the visual feature is inaccurate, LLM can capture useful information from the image captions for response generation. Extensive experiments on hallucination evaluation and prompt-augmented datasets demonstrate that our PACU method can work well with existing schemes to effectively boost VLLM model performance. Code is available in <a class="link-external link-https" href="https://github.com/zhaominyiz/PACU" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address the hallucination phenomenon that occurs in Vision Language Large Models (VLLMs) when handling augmented prompts. Specifically, some existing VLLMs may produce erroneous or irrelevant content when faced with augmented prompts (such as adding words, rewriting, etc.), which severely limits the performance and application of these models.
### Solution
To overcome this issue, the authors propose a new framework called **Prompt Augmentation and Caption Utilization (PACU)**. PACU enhances the generative capabilities of VLLMs through the following two aspects:
1. **Prompt Augmentation**:
- Automatically generate and evaluate diverse prompts using existing large language models (LLMs).
- Generate high-quality prompts through an automatic prompt augmentation module and score these prompts through a prompt evaluation module, retaining high-quality prompts for model training.
2. **Caption Utilization**:
- Utilize image captions as prior knowledge to assist in response generation.
- When visual features are inaccurate, LLMs can capture useful information from captions to generate more accurate responses.
### Experimental Validation
The authors extensively validated the effectiveness of PACU through experiments. The results show that PACU not only improves the performance of VLLMs in handling augmented prompts but also enhances the accuracy of the models under the original unaugmented prompt settings. Additionally, PACU can work well with various VLLM frameworks and LLMs, demonstrating generality and flexibility.
### Main Contributions
1. **Introducing Prompt Augmentation to VLLMs**: Through extensive testing, it was found that existing VLLMs have significant deficiencies in handling augmented prompts.
2. **Proposing the PACU Framework**: By generating, evaluating, and reweighting diverse augmented prompts, and designing a caption utilization generation mechanism, the prompt handling capability of VLLMs is enhanced.
3. **Experimental Validation**: Experimental results on multiple hallucination evaluation datasets demonstrate the effectiveness and advantages of the PACU method.
### Related Work
- **Vision Language Large Models (VLLMs)**: Combine large language models and visual modalities to perform various vision-language understanding tasks.
- **Anti-Hallucination Techniques**: Many studies aim to detect and eliminate hallucination phenomena in VLLMs, but these techniques still have shortcomings in handling augmented prompts.
### Methodology
The PACU framework includes a prompt augmentation module and a caption utilization generation mechanism. The prompt augmentation module generates diverse prompts and screens high-quality prompts through an evaluation module. The caption utilization generation mechanism uses image captions when generating responses to ensure more accurate responses.
### Performance Evaluation
Experimental results show that PACU significantly improves the performance of VLLMs on multiple benchmark datasets, especially in handling augmented prompts. Visualization results further demonstrate the advantages of PACU, particularly in dealing with complex and diverse prompts.
### Conclusion
As a plug-in framework, PACU can work in conjunction with existing anti-hallucination solutions, VLLM frameworks, and LLMs, effectively enhancing the prompt handling capability and generation quality of VLLMs.