Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Nicholas Moratelli,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara
2024-08-30
Abstract:Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key flaws in existing image captioning models when generating descriptions: 1. **Limitations of Traditional Reward Mechanisms**: Fine - tuning with manually - designed rewards (such as CIDEr metric) can improve the caption quality at the sequence level, but it often restricts the richness and semantic depth of descriptions and makes the model tend to imitate the style of real sentences, thus losing details and specificity. 2. **Reward Problems Based on the CLIP Model**: Recent attempts to use image - text models like CLIP as rewards can generate semantically rich sentences, but often lead to grammar errors and repetition problems and ignore the correct word order. To solve these problems, the authors propose a new method named Self - Cap. Self - Cap relies on a learnable reward model, which is based on self - generated negative samples and can distinguish the consistency between captions and their corresponding images. Specifically, their discriminator is a fine - tuned contrastive image - text model, aiming to promote the correctness of captions while avoiding the anomalies common in CLIP - based reward training. In addition, by directly integrating negative samples from the frozen caption generator, the quality and richness of the generated captions are significantly improved, and the optimization time is also reduced. ### Main Contributions - **Proposing a New Self - Supervised Reward Mechanism**: By introducing self - generated negative samples to improve the consistency evaluation between captions and images, the limitations of traditional reward mechanisms are avoided. - **Improving the Quality and Richness of Caption Generation**: Experimental results show that Self - Cap performs well on both standard and zero - shot image captioning datasets, and the generated captions are not only more in line with human judgment but also more grammatically accurate. ### Formula Representation Some formulas involved in the paper include: - **Cross - Entropy Loss**: \[ L_{XE}(\theta)=-\sum_{t = 1}^{T}\log\left(P(w_t|w_{1:t - 1}, R)\right) \] - **CLIP Similarity Calculation**: \[ S_{ij}=\text{sim}(T_i, V_j) \] where $\text{sim}(\cdot)$ represents cosine similarity. - **SCST Expected Gradient**: \[ \nabla_\theta L_{SCST}(I_i, s'_i, \theta)=(D_r(I_i, s'_i)-b)\nabla_\theta\log f_\theta(s'_i) \] where $b$ is the baseline, used to reduce the variance of gradient estimation. Through these improvements, Self - Cap can generate more rich and natural image captions while maintaining grammatical accuracy.