Contrastive Semantic Similarity Learning for Image Captioning Evaluation

Chao Zeng,Sam Kwong,Tiesong Zhao,Hanli Wang
DOI: https://doi.org/10.1016/j.ins.2022.07.142
IF: 8.1
2022-01-01
Information Sciences
Abstract:Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most current captioning metrics rely on token-level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric I2CE (Intrinsic Image Captioning Evaluation). For learning the evaluation metric, we develop three progressive model structures capturing the sentence level representations–single branch model, dual branches model, and triple branches model. For evaluation of the proposed metric, we select one automatic captioning model and collect human scores on the quality of the generated captions. We introduce a statistical test on the correlation between human scores and metric scores. Our proposed metric I2CE achieves the Spearman correlation value of 51.42, which is better than the score of 41.95 achieved by one recently proposed BERT-based metric. The result is also better than the conventional rule-based metrics. Extensive results on the Composite-coco dataset and PASCAL-50S also validate the effectiveness of our proposed metric. The proposed metric could serve as a novel indicator of the intrinsic information between captions, which complements the existing ones.
What problem does this paper attempt to address?