Polos: Multimodal Metric Learning from Human Feedback for Image Captioning

Yuiga Wada,Kanta Kaneda,Daichi Saito,Komei Sugiura
2024-02-28
Abstract:Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper aims to address the inconsistency between automatic evaluation metrics and human judgments in image captioning models. Although existing data-driven metrics are more correlated with human judgments than traditional metrics such as CIDEr, they are insufficient in handling illusions and generalizing across different image texts. In this study, a new supervised automatic evaluation metric called Polos is proposed, which utilizes parallel feature extraction using embeddings trained with large-scale contrastive learning. To train Polos, they introduce a multimodal metric learning framework based on human feedback (M2LHF) and construct a Polaris dataset containing 131K human judgments, approximately ten times the scale of existing datasets. Experimental results demonstrate that Polos achieves state-of-the-art performance on multiple benchmark tests, proving its effectiveness and robustness.