DENEB: A Hallucination-Robust Automatic Evaluation Metric for Image Captioning

Kazuki Matsuda,Yuiga Wada,Komei Sugiura
2024-10-24
Abstract:In this work, we address the challenge of developing automatic evaluation metrics for image captioning, with a particular focus on robustness against hallucinations. Existing metrics are often inadequate for handling hallucinations, primarily due to their limited ability to compare candidate captions with multifaceted reference captions. To address this shortcoming, we propose DENEB, a novel supervised automatic evaluation metric specifically robust against hallucinations. DENEB incorporates the Sim-Vec Transformer, a mechanism that processes multiple references simultaneously, thereby efficiently capturing the similarity between an image, a candidate caption, and reference captions. To train DENEB, we construct the diverse and balanced Nebula dataset comprising 32,978 images, paired with human judgments provided by 805 annotators. We demonstrated that DENEB achieves state-of-the-art performance among existing LLM-free metrics on the FOIL, Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and PASCAL-50S datasets, validating its effectiveness and robustness against hallucinations.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper attempts to solve a key problem in image captioning: existing automatic evaluation metrics perform poorly when dealing with hallucinations. Hallucinations refer to the inclusion of content not present in the image in the generated captions, which is especially problematic in application scenarios requiring high reliability, such as helping the visually impaired, medical image analysis, and robot interpretation generation. Specifically, the paper points out that the current evaluation metrics mainly have the following deficiencies: 1. **Insufficient robustness to hallucinations**: Existing metrics have limited ability to handle multi - perspective reference captions, making it difficult for them to distinguish between correct captions and captions containing hallucinations. 2. **Dependence on a single reference**: Most existing metrics can only handle each reference caption independently and fail to fully utilize the rich information provided by multiple reference captions. 3. **Low correlation with human judgment**: Although some data - driven metrics have a relatively high correlation with human judgment, they are still insufficient in dealing with hallucinations. To solve these problems, the authors propose a new supervised automatic evaluation metric, Deneb, which has the following characteristics: - **Sim - Vec Transformer**: This is a new architecture that can handle multiple reference captions simultaneously and effectively capture the similarities between the image, candidate captions, and reference captions. - **Nebula dataset**: To train Deneb, the authors constructed a diverse and balanced dataset, Nebula, which contains 32,978 images and human judgments provided by 805 annotators. - **Improved feature extraction method**: By combining the Hadamard product and element - level differences, features are extracted from the embedding vectors extracted from CLIP and RoBERTa to better capture similarities. Experimental results show that Deneb achieves state - of - the - art performance on multiple benchmark datasets such as FOIL, Composite, Flickr8K - Expert, Flickr8K - CF, Nebula, and PASCAL - 50S, especially in dealing with hallucinations. In summary, this paper aims to develop a new evaluation metric, Deneb, to improve the robustness and accuracy to hallucinations in the image captioning task, thereby ensuring that the generated captions are more reliable and accurate.