VLRM: Vision-Language Models act as Reward Models for Image Captioning

Maksim Dzabraev,Alexander Kunitsyn,Andrei Ivaniuta
2024-04-02
Abstract:In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?