Abstract:Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.

What problem does this paper attempt to address?

This paper attempts to address two main challenges in image captioning evaluation: 1. **Poor correlation with human judgment**: Commonly used evaluation metrics such as CIDEr, METEOR, ROUGE, and BLEU are often not well - correlated with human judgment. These word - overlap - based metrics have difficulty in capturing the semantic information of sentences, so there is a large difference from human judgment when evaluating the quality of image captions. 2. **Blind spots for pathological constructions**: Each evaluation metric has its known blind spots, especially rule - based metrics, which are difficult to repair once these blind spots are identified. For example, the newly proposed SPICE, although it performs better in terms of correlation with human judgment, ignores the syntactic structure of sentences. In addition, SPICE tends to give high scores to long sentences with repeated clauses, which makes it less flexible in handling certain specific situations. To address these two challenges, the authors propose a new learning - based discriminative evaluation metric, which is directly trained to distinguish between human - and machine - generated captions. In addition, a data augmentation scheme is also proposed. By explicitly incorporating pathological transformations as negative samples into the training process, the model becomes more robust and its correlation with human judgment is improved. Specifically, the method includes the following key steps: - **Model architecture**: Use a convolutional neural network (CNN) to extract image features, use a long - short - term memory network (LSTM) to encode captions, and then perform discrimination through a binary classifier. The input of the model includes an image and a candidate caption, and the output is the probability of whether the caption is written by a human. - **Data augmentation**: Define three transformations (random captioning, word permutation, random word replacement) to generate a large number of pathological samples as negative samples to improve the robustness of the model. - **Performance evaluation**: The effectiveness and robustness of the proposed method are verified through multiple tests. In particular, the experimental results on the COCO and Flickr 8k datasets show that this method is superior to existing image caption evaluation metrics in terms of correlation with human judgment and robustness to pathological transformations. In summary, this paper aims to improve the evaluation of image caption generation through a learning - based method, so that it can not only be better correlated with human judgment, but also flexibly deal with various pathological constructions.

Learning to Evaluate Image Captioning

Contrastive Semantic Similarity Learning for Image Captioning Evaluation

G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o

HICEScore: A Hierarchical Metric for Image Captioning Evaluation

A Novel Evaluation Framework for Image2Text Generation

Cobra Effect in Reference-Free Image Captioning Metrics

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

Adversarial Learning-Based Automatic Evaluator for Image Captioning

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method

Towards Annotation-Free Evaluation of Cross-Lingual Image Captioning

Towards Unique and Informative Captioning of Images

InfoMetIC: an Informative Metric for Reference-free Image Caption Evaluation

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Feedback Evaluations to Promote Image Captioning

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

CLAIR: Evaluating Image Captions with Large Language Models

Knowing What to Learn: A Metric-Oriented Focal Mechanism for Image Captioning

Enhance Training Objectives for Image Captioning with Decomposed Sequence-level Metric

SPICE: Semantic Propositional Image Caption Evaluation

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation