HICEScore: A Hierarchical Metric for Image Captioning Evaluation

Zequn Zeng,Jianqiao Sun,Hao Zhang,Tiansheng Wen,Yudi Su,Yan Xie,Zhengjue Wang,Bo Chen
DOI: https://doi.org/10.1145/3664647.3681358
2024-07-26
Abstract:Image captioning evaluation metrics can be divided into two categories, reference-based metrics and reference-free metrics. However, reference-based approaches may struggle to evaluate descriptive captions with abundant visual details produced by advanced multimodal large language models, due to their heavy reliance on limited human-annotated references. In contrast, previous reference-free metrics have been proven effective via CLIP cross-modality similarity. Nonetheless, CLIP-based metrics, constrained by their solution of global image-text compatibility, often have a deficiency in detecting local textual hallucinations and are insensitive to small visual objects. Besides, their single-scale designs are unable to provide an interpretable evaluation process such as pinpointing the position of caption mistakes and identifying visual regions that have not been described. To move forward, we propose a novel reference-free metric for image captioning evaluation, dubbed Hierarchical Image Captioning Evaluation Score (HICE-S). By detecting local visual regions and textual phrases, HICE-S builds an interpretable hierarchical scoring mechanism, breaking through the barriers of the single-scale structure of existing reference-free metrics. Comprehensive experiments indicate that our proposed metric achieves the SOTA performance on several benchmarks, outperforming existing reference-free metrics like CLIP-S and PAC-S, and reference-based metrics like METEOR and CIDEr. Moreover, several case studies reveal that the assessment process of HICE-S on detailed captions closely resembles interpretable human judgments.Our code is available at <a class="link-external link-https" href="https://github.com/joeyz0z/HICE" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing image caption evaluation metrics, which are specifically manifested in the following aspects: 1. **Reference - Dependence Problem**: Traditional reference - based evaluation metrics (such as BLEU, METEOR, CIDEr, etc.) rely on a limited number of manually - annotated references, which are insufficient when dealing with detailed descriptive captions generated by advanced multimodal large - language models. The captions generated by these models usually contain rich visual details, while the existing references may not cover all the details, resulting in inaccurate evaluation. 2. **Insufficient Detection of Local Details**: Existing reference - free evaluation metrics (such as CLIP - S and PAC - S) perform well in terms of global image - text compatibility, but poorly in detecting local text hallucinations (i.e., wrong descriptions in captions) and small objects. This is because these methods mainly focus on global similarity and ignore the local relationships between image regions and text phrases. 3. **Limitations of Single - Scale Design**: Existing reference - free evaluation metrics adopt a single - scale design and cannot provide an interpretable evaluation process, for example, they cannot accurately locate the error positions in captions or identify un - described visual areas. To solve the above problems, the author proposes a new reference - free evaluation metric - **Hierarchical Image Caption Evaluation Score (HICE - S)**. HICE - S improves existing evaluation methods in the following ways: - **Hierarchical Scoring Mechanism**: HICE - S not only considers global image - text compatibility (gITC), but also introduces local image - text compatibility (lITC) to capture the fine - grained relationships between images and captions. - **Interpretability**: By detecting local visual regions and text phrases, HICE - S can provide an interpretable scoring process, helping to locate errors in captions and identify un - described visual areas. - **Performance Improvement**: Experimental results show that HICE - S performs excellently in multiple benchmark tests, surpassing existing reference - free evaluation metrics, and its evaluation process is highly consistent with human judgment. In conclusion, this paper aims to overcome the limitations of existing image caption evaluation metrics by proposing HICE - S, so as to more accurately evaluate the detailed descriptive captions generated by multimodal large - language models.