Abstract:While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. Code and models can be found at: <a class="link-external link-https" href="https://github.com/assafbk/mocha_code" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issue of hallucinations in image captioning. Specifically, hallucinations refer to the inclusion of false details in the generated text that cannot be inferred from the given image. Existing methods mostly use a closed vocabulary object list to detect or mitigate hallucinations in image descriptions, but this approach overlooks the long-tail nature of hallucinations in practical applications. To tackle this challenge, the authors propose a framework to address hallucinations in an open vocabulary setting. ### Main Contributions 1. **OpenCHAIR Benchmark**: A new benchmark for evaluating open vocabulary object hallucinations, generating diverse synthetic images and descriptions through a base model, surpassing the existing CHAIR benchmark in terms of diversity and accuracy. 2. **MOCHa Framework**: A reinforcement learning-based approach to mitigate hallucinations in an open vocabulary setting. This method explicitly optimizes the fidelity and sufficiency of the generated text through a multi-objective reward function without requiring strong supervision. 3. **Experimental Results**: Experiments demonstrate the advantages of OpenCHAIR in measuring hallucinations in an open setting and the effectiveness of MOCHa in reducing hallucinations. ### Method Overview - **OpenCHAIR Benchmark**: - Uses large language models (LLM) to generate diverse synthetic descriptions. - Utilizes text-to-image generation models to create corresponding images. - Ensures the quality of generated images and descriptions through manual filtering. - During evaluation, uses LLM to determine if the generated descriptions contain hallucinated objects. - **MOCHa Framework**: - Designs a multi-objective reward function, including fidelity objectives, sufficiency objectives, and a KL regularization term. - Uses the Proximal Policy Optimization (PPO) algorithm to optimize and generate high-quality, factually accurate descriptions. - Balances the fidelity and sufficiency of the generated text by adjusting reward weight parameters. ### Experimental Results - **Quantitative Results**: On the MS-COCO dataset, the MOCHa-optimized model reduces hallucinations while maintaining or improving standard description quality metrics (e.g., BLEU-4, CIDEr). - **Qualitative Results**: The optimized model generates descriptions that better match the image content while retaining sufficient detail. - **Comparison with Existing Methods**: MOCHa outperforms existing methods for reducing hallucinations on multiple metrics, especially in an open vocabulary setting. ### Conclusion By introducing the OpenCHAIR benchmark and the MOCHa framework, the paper effectively addresses the issue of hallucinations in image captioning, particularly in an open vocabulary setting. These methods not only improve the quality of generated descriptions but also provide new tools and directions for future research.

Mitigating Open-Vocabulary Caption Hallucinations

ALOHa: A New Measure for Hallucination in Captioning Models

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

Mitigating Object Hallucination via Concentric Causal Attention

Multi-Modal Hallucination Control by Visual Information Grounding

ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

Hallucination Augmented Contrastive Learning for Multimodal Large Language Model

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

A Unified Hallucination Mitigation Framework for Large Vision-Language Models

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Mitigating Multilingual Hallucination in Large Vision-Language Models

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

HallE-Control: Controlling Object Hallucination in Large Multimodal Models

Hallucination Mitigation Prompts Long-term Video Understanding

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?

HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption