Abstract:Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules. ICD contrasts distributions from standard and instruction disturbance, thereby increasing alignment uncertainty and effectively subtracting hallucinated concepts from the original distribution. Through comprehensive experiments on discriminative benchmarks (POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that ICD significantly mitigates both object-level and attribute-level hallucinations. Moreover, our method not only addresses hallucinations but also significantly enhances the general perception and recognition capabilities of LVLMs.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the hallucination phenomenon in large vision - language models (LVLMs) during multimodal decision - making and open - generation tasks. Specifically, although LVLMs can generate context - detailed and coherent responses from visual inputs, they often inaccurately represent visual content when generating text, leading to object - level and attribute - level hallucinations. The paper introduces a new method - Instruction Contrastive Decoding (ICD) - aimed at reducing the hallucination phenomenon during LVLM inference. By comparing the distributions under standard instructions and perturbed instructions, the ICD method can increase alignment uncertainty, thereby effectively subtracting hallucination concepts from the original distribution.
### Main contributions of the paper:
1. **In - depth analysis of how perturbed instructions exacerbate hallucinations**: The paper provides a detailed understanding of the root causes of hallucinations from the perspectives of statistical bias and language priors.
2. **Introduction of the ICD method**: This is a novel strategy that emphasizes highlighting and then reducing hallucinations first, reducing hallucinations during the inference process by adjusting the distribution.
3. **Verification of the effectiveness of the ICD method through extensive experiments**: The paper demonstrates the significant effect of the ICD method in reducing hallucinations in multiple benchmark tests and consistently improves performance in 14 general perception and recognition tasks.
### Method overview:
- **LVLM inference process**: LVLMs consist of a visual encoder, a fusion module, and a language model. The visual encoder extracts visual features, the fusion module achieves multimodal alignment, and the language model generates text responses.
- **Instruction perturbation**: By adding role prefixes (positive or negative) in front of the original instructions, the uncertainty of multimodal alignment is affected, thereby exacerbating hallucinations.
- **Instruction Contrastive Decoding (ICD)**: By comparing the probability distributions under standard instructions and perturbed instructions, words that maximize the probability of standard instructions while minimizing the probability of perturbed instructions are selected, thereby reducing hallucinations. The ICD method also combines adaptive credibility constraints to further optimize the contrastive distribution.
### Experimental results:
- **POPE benchmark**: The ICD method significantly improves accuracy, precision, recall, and F1 - score on three subsets: MSCOCO, A - OKVQA, and GQA.
- **MME benchmark**: The ICD method performs excellently in object - level and attribute - level hallucination tasks, significantly outperforming the baseline model and the VCD method.
- **LLaVa - Bench benchmark**: Through case studies, the effectiveness of the ICD method in open - generation tasks is qualitatively evaluated.
In conclusion, this paper proposes an effective method to reduce the hallucination phenomenon in LVLMs, which not only improves the accuracy of the model but also enhances its overall performance in multimodal tasks.