Learning to Evaluate Image Captioning

Yin Cui,Guandao Yang,Andreas Veit,Xun Huang,Serge Belongie
DOI: https://doi.org/10.48550/arXiv.1806.06422
2018-06-18
Abstract:Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challenges, we propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. In addition, we further propose a data augmentation scheme to explicitly incorporate pathological transformations as negative examples during training. The proposed metric is evaluated with three kinds of robustness tests and its correlation with human judgments. Extensive experiments show that the proposed data augmentation scheme not only makes our metric more robust toward several pathological transformations, but also improves its correlation with human judgments. Our metric outperforms other metrics on both caption level human correlation in Flickr 8k and system level human correlation in COCO. The proposed approach could be served as a learning based evaluation metric that is complementary to existing rule-based metrics.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address two main challenges in image captioning evaluation: 1. **Poor correlation with human judgment**: Commonly used evaluation metrics such as CIDEr, METEOR, ROUGE, and BLEU are often not well - correlated with human judgment. These word - overlap - based metrics have difficulty in capturing the semantic information of sentences, so there is a large difference from human judgment when evaluating the quality of image captions. 2. **Blind spots for pathological constructions**: Each evaluation metric has its known blind spots, especially rule - based metrics, which are difficult to repair once these blind spots are identified. For example, the newly proposed SPICE, although it performs better in terms of correlation with human judgment, ignores the syntactic structure of sentences. In addition, SPICE tends to give high scores to long sentences with repeated clauses, which makes it less flexible in handling certain specific situations. To address these two challenges, the authors propose a new learning - based discriminative evaluation metric, which is directly trained to distinguish between human - and machine - generated captions. In addition, a data augmentation scheme is also proposed. By explicitly incorporating pathological transformations as negative samples into the training process, the model becomes more robust and its correlation with human judgment is improved. Specifically, the method includes the following key steps: - **Model architecture**: Use a convolutional neural network (CNN) to extract image features, use a long - short - term memory network (LSTM) to encode captions, and then perform discrimination through a binary classifier. The input of the model includes an image and a candidate caption, and the output is the probability of whether the caption is written by a human. - **Data augmentation**: Define three transformations (random captioning, word permutation, random word replacement) to generate a large number of pathological samples as negative samples to improve the robustness of the model. - **Performance evaluation**: The effectiveness and robustness of the proposed method are verified through multiple tests. In particular, the experimental results on the COCO and Flickr 8k datasets show that this method is superior to existing image caption evaluation metrics in terms of correlation with human judgment and robustness to pathological transformations. In summary, this paper aims to improve the evaluation of image caption generation through a learning - based method, so that it can not only be better correlated with human judgment, but also flexibly deal with various pathological constructions.