Cross-Domain Image Captioning with Discriminative Finetuning

Roberto Dessì,Michele Bevilacqua,Eleonora Gualdoni,Nathanael Carraz Rakotonirina,Francesca Franzon,Marco Baroni
2023-04-04
Abstract:Neural captioners are typically trained to mimic human-generated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to address the issue of ambiguity in image caption generation, where the descriptions generated by existing neural networks are often not specific or informative enough. Specifically, the paper proposes a method based on Discriminative Finetuning, which optimizes a pre-trained image caption generator through reinforcement learning to make its descriptions more discriminative. ### Main Contributions: 1. **Improvement in Cross-Domain Zero-Shot Image Captioning**: - The paper demonstrates a method that significantly enhances the performance of cross-domain zero-shot image captioning through discriminative finetuning. This method performs exceptionally well on different datasets, especially in scenarios where the target domain lacks annotated data. 2. **Enhanced Image Retrieval Performance**: - The model finetuned with discriminative methods not only excels in image retrieval tasks under neural text conditions but also proves more useful than original human descriptions in assisting human annotators with image differentiation tasks. ### Method Overview: 1. **Discriminative Self-Supervised Training**: - The pre-trained image caption generator is finetuned through reinforcement learning so that the descriptions it generates can help a frozen discriminator identify the target image from a set of candidate images. 2. **Experimental Validation**: - Experiments were conducted on multiple datasets, including COCO and Conceptual Captions, demonstrating superior performance in cross-domain zero-shot tasks. Additionally, human-involved experiments showed that the descriptions generated after discriminative finetuning were more informative, even surpassing the original human descriptions.