Abstract:In this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due to their limited ability to compare candidate captions with multifaceted reference captions. To address this shortcoming, we propose DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations. DENEB incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions. To train DENEB, we construct the diverse and balanced Nebula dataset comprising 32,978 images, paired with human judgments provided by 805 annotators. We demonstrated that DENEB achieves state-of-the-art performance among existing LLM-free metrics on the FOIL, Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and PASCAL-50S datasets, validating its effectiveness and robustness against hallucinations.

What problem does this paper attempt to address?

This paper attempts to solve a key problem in image captioning: existing automatic evaluation metrics perform poorly when dealing with hallucinations. Hallucinations refer to the inclusion of content not present in the image in the generated captions, which is especially problematic in application scenarios requiring high reliability, such as helping the visually impaired, medical image analysis, and robot interpretation generation. Specifically, the paper points out that the current evaluation metrics mainly have the following deficiencies: 1. **Insufficient robustness to hallucinations**: Existing metrics have limited ability to handle multi - perspective reference captions, making it difficult for them to distinguish between correct captions and captions containing hallucinations. 2. **Dependence on a single reference**: Most existing metrics can only handle each reference caption independently and fail to fully utilize the rich information provided by multiple reference captions. 3. **Low correlation with human judgment**: Although some data - driven metrics have a relatively high correlation with human judgment, they are still insufficient in dealing with hallucinations. To solve these problems, the authors propose a new supervised automatic evaluation metric, Deneb, which has the following characteristics: - **Sim - Vec Transformer**: This is a new architecture that can handle multiple reference captions simultaneously and effectively capture the similarities between the image, candidate captions, and reference captions. - **Nebula dataset**: To train Deneb, the authors constructed a diverse and balanced dataset, Nebula, which contains 32,978 images and human judgments provided by 805 annotators. - **Improved feature extraction method**: By combining the Hadamard product and element - level differences, features are extracted from the embedding vectors extracted from CLIP and RoBERTa to better capture similarities. Experimental results show that Deneb achieves state - of - the - art performance on multiple benchmark datasets such as FOIL, Composite, Flickr8K - Expert, Flickr8K - CF, Nebula, and PASCAL - 50S, especially in dealing with hallucinations. In summary, this paper aims to develop a new evaluation metric, Deneb, to improve the robustness and accuracy to hallucinations in the image captioning task, thereby ensuring that the generated captions are more reliable and accurate.

DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning

DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning

Learning to Evaluate Image Captioning

ALOHa: A New Measure for Hallucination in Captioning Models

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

Neuraltalk+: neural image captioning with visual assistance capabilities

Mitigating Open-Vocabulary Caption Hallucinations

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models

Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent

HallE-Control: Controlling Object Hallucination in Large Multimodal Models

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions

Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment

HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption

See or Guess: Counterfactually Regularized Image Captioning

Mitigating Fine-Grained Hallucination by Fine-Tuning Large Vision-Language Models with Caption Rewrites

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics

Neural Image Caption Generation with Weighted Training and Reference

From Captions to Visual Concepts and Back

FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs