Mitigating Open-Vocabulary Caption Hallucinations

Assaf Ben-Kish,Moran Yanuka,Morris Alper,Raja Giryes,Hadar Averbuch-Elor
2024-10-17
Abstract:While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. Code and models can be found at: <a class="link-external link-https" href="https://github.com/assafbk/mocha_code" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of hallucinations in image captioning. Specifically, hallucinations refer to the inclusion of false details in the generated text that cannot be inferred from the given image. Existing methods mostly use a closed vocabulary object list to detect or mitigate hallucinations in image descriptions, but this approach overlooks the long-tail nature of hallucinations in practical applications. To tackle this challenge, the authors propose a framework to address hallucinations in an open vocabulary setting. ### Main Contributions 1. **OpenCHAIR Benchmark**: A new benchmark for evaluating open vocabulary object hallucinations, generating diverse synthetic images and descriptions through a base model, surpassing the existing CHAIR benchmark in terms of diversity and accuracy. 2. **MOCHa Framework**: A reinforcement learning-based approach to mitigate hallucinations in an open vocabulary setting. This method explicitly optimizes the fidelity and sufficiency of the generated text through a multi-objective reward function without requiring strong supervision. 3. **Experimental Results**: Experiments demonstrate the advantages of OpenCHAIR in measuring hallucinations in an open setting and the effectiveness of MOCHa in reducing hallucinations. ### Method Overview - **OpenCHAIR Benchmark**: - Uses large language models (LLM) to generate diverse synthetic descriptions. - Utilizes text-to-image generation models to create corresponding images. - Ensures the quality of generated images and descriptions through manual filtering. - During evaluation, uses LLM to determine if the generated descriptions contain hallucinated objects. - **MOCHa Framework**: - Designs a multi-objective reward function, including fidelity objectives, sufficiency objectives, and a KL regularization term. - Uses the Proximal Policy Optimization (PPO) algorithm to optimize and generate high-quality, factually accurate descriptions. - Balances the fidelity and sufficiency of the generated text by adjusting reward weight parameters. ### Experimental Results - **Quantitative Results**: On the MS-COCO dataset, the MOCHa-optimized model reduces hallucinations while maintaining or improving standard description quality metrics (e.g., BLEU-4, CIDEr). - **Qualitative Results**: The optimized model generates descriptions that better match the image content while retaining sufficient detail. - **Comparison with Existing Methods**: MOCHa outperforms existing methods for reducing hallucinations on multiple metrics, especially in an open vocabulary setting. ### Conclusion By introducing the OpenCHAIR benchmark and the MOCHa framework, the paper effectively addresses the issue of hallucinations in image captioning, particularly in an open vocabulary setting. These methods not only improve the quality of generated descriptions but also provide new tools and directions for future research.