The Solution for the CVPR2024 NICE Image Captioning Challenge

Longfei Huang,Shupeng Zhong,Xiangyu Wu,Ruoxuan Li
2024-04-29
Abstract:This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach achieves a CIDEr score of 234.11.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality image captions in the zero - shot image captioning task. Specifically, the paper focuses on how to generate text descriptions that match the image content and are of high quality without annotation for a specific dataset. The paper points out that most of the existing image captioning datasets are obtained through web crawling, resulting in inconsistent data quality and a different style from manually - annotated data. Therefore, the paper proposes a method to enhance the quality of image captions, especially when facing new annotation styles and content. Through retrieval - enhancement and caption - ranking strategies, it improves the quality and matching degree of the captions generated by the model. This method not only improves the performance of the model in the zero - shot image captioning task, but also solves the problem of error accumulation that may occur when using the captions generated by the model.